Cache simulation for irregular memory traffic on multi-core CPUs: Case study on performance models for sparse matrix–vector multiplication

doi:10.1016/j.jpdc.2020.05.020

Journal of Parallel and Distributed Computing

Volume 144, October 2020, Pages 189-205

https://doi.org/10.1016/j.jpdc.2020.05.020 Get rights and content

Highlights

•
Method for estimating irregular data traffic in multi-core memory hierarchies.
•
Detailed performance modelling for bandwidth-limited computations.
•
Experiments quantifying bottlenecks of sparse matrix–vector multiplication.

Abstract

Parallel computations with irregular memory access patterns are often limited by the memory subsystems of multi-core CPUs, though it can be difficult to pinpoint and quantify performance bottlenecks precisely. We present a method for estimating volumes of data traffic caused by irregular, parallel computations on multi-core CPUs with memory hierarchies containing both private and shared caches. Further, we describe a performance model based on these estimates that applies to bandwidth-limited computations. As a case study, we consider two standard algorithms for sparse matrix–vector multiplication, a widely used, irregular kernel. Using three different multi-core CPU systems and a set of matrices that induce a range of irregular memory access patterns, we demonstrate that our cache simulation combined with the proposed performance model accurately quantifies performance bottlenecks that would not be detected using standard best- or worst-case estimates of the data traffic volume.

Introduction

Performance is a high priority in scientific computations, and so meticulous work is devoted to optimising the underlying code. During such optimisation efforts, performance models are valuable tools for directing attention towards pressure points, and indicating when optimisations are good enough and expending further effort would be unproductive. For instance, the popular Roofline model [41] bounds performance in terms of a CPU’s peak computational capacity and memory bandwidth together with an algorithm’s computational intensity. Because CPUs have hierarchical memories, the bandwidth and computational intensity can vary depending on the memory hierarchy level that is considered. Moreover, the computational intensity depends not only on parameters such as cache size, but also on the memory access pattern of the computation. Recently, more elaborate performance models have been developed for stencil codes [8], [32], [46], where they have been used to evaluate the effectiveness of spatial and temporal blocking optimisations. In these cases, the amounts of data transferred between levels of the memory hierarchy are known in advance, because memory accesses are predictable and depend only on the problem size and the order of the stencil. Unfortunately, this is not the case for irregular computations, where memory access patterns depend on data that may only be known at runtime.

When faced with irregular access patterns, the typical approach is to derive estimates of the memory traffic for the worst- or best-case scenarios. These are “paper and pencil” estimates that have the advantage of being cheap to produce, not requiring any implementation or actual machine to run. On the other hand, such estimates are crude and can in reality be far from the true data traffic volumes, thereby rendering little help in understanding the actual performance that is achieved. For example, Fig. 1 shows worst- and best-case estimates for sparse matrix–vector multiplication (SpMV), a widely used computational kernel that suffers from both irregularity and low computational intensity. Due to the considerable difference between the best- and worst-case data traffic, these estimates cannot provide much confidence if they are used to evaluate whether the performance of a given kernel implementation is good enough. In this case, more accurate estimation of data traffic volumes is needed for performance validation. In general, numerous computational kernels face the same issues due to irregular memory accesses that arise through the use of sparse data structures, such as graphs or unstructured meshes.

In this paper, we present a method for quantifying the amounts of data transferred between levels of a multi-core CPU’s memory hierarchy during irregular computations. The estimated data traffic volumes are produced by a trace-driven cache simulation that relies on a few basic assumptions and a simplified model of the memory hierarchy. Moreover, the method applies to memory hierarchies with shared caches, a common feature of contemporary multi-core CPUs, and a case that is not always addressed by existing analytical cache models [1], [12]. Because the proposed method is based on tracing a sequence of memory references, it requires some amount of computation that is likely to be at least as much as the cost of executing the kernel itself. However, the method remains applicable in cases where the actual machine in question is not available, or the data traffic cannot be quantified directly through hardware monitoring facilities, for example, because these facilities are unavailable, unreliable or the results are not easily interpreted.

Because of its importance and familiarity as an irregular computational kernel, we use SpMV to demonstrate that our cache simulation accurately quantifies the volumes of data transfers in the memory hierarchies of two Intel-based multi-core CPU systems. In turn, these data transfer volumes are used to give accurate performance predictions that are unavailable through the use of simple worst- or best-case estimates. We also give performance predictions for an AMD Epyc CPU, and explore some limitations of the proposed method using a variant of SpMV that not only includes irregular reads, but also irregular writes. Ultimately, these predictions result in a quantitative understanding of SpMV performance, which, for example, can be used to check that the observed performance of a given implementation matches our expectations, and that the implementation is free from hidden performance issues.

The remainder of this paper is organised as follows. In the next section, we describe our cache simulation approach for estimating the data traffic volumes of computations with irregular memory accesses. In Section 3, we present a performance model for bandwidth-limited computations, where the relevant data traffic volumes are used together with realistic memory and CPU cache bandwidths. Next, in Section 4, we recall standard SpMV algorithms for matrices in the compressed sparse row (CSR) and coordinate (COO) storage formats. We also review known bounds on the volume of data traffic generated by the CSR SpMV algorithm, which is later used to compare with the results of our cache simulation method. Then, in Section 5, we describe experiments that are used to validate the estimated data traffic volumes and the performance model for the studied SpMV algorithms. Finally, we briefly discuss related work in Section 6 and draw our conclusions in Section 7.

Section snippets

Quantifying data traffic for irregular, parallel computations

To estimate the data traffic volume for a given computation on a multi-core CPU system, we consider the sequence of load and store operations that would be performed by the participating CPUs. Then we simulate a cache’s behaviour using a simplified model based on the established ideal-cache model [12], which is ordinarily used in the design of cache-oblivious algorithms. We depart from the ideal-cache model in two ways. First, for practical reasons, we assume a least recently used replacement

A performance model based on data traffic and bandwidth

In this section, we describe a performance model for computations that are limited by cache or memory bandwidth by tying the execution time to relevant data traffic volumes between levels inside a memory hierarchy. In the following model, we assume that computation and memory accesses overlap, and, moreover, that the dominant cost is due to memory accesses, so that computations can be neglected. In addition, data must be transferred between adjacent memory hierarchy levels, and we assume that

Sparse matrix–vector multiplication

The multiplication of a sparse matrix with a dense vector, or SpMV, is a prime example of an irregular, parallel computation. It is also a fundamental computational kernel that appears in numerous scientific applications. For example, SpMV is performed repeatedly in iterative methods for solving sparse linear systems, such as Krylov subspace methods [30]. The efficiency of these methods often hinges on the SpMV computations that are required during each iteration, but it is well known that SpMV

Numerical experiments

In this section, we describe experiments that test the accuracy of data traffic estimates obtained with the cache simulation method described in Section 2, focusing on the CSR-based SpMV kernel in Algorithm 2. Next, we use the data traffic estimates for CSR SpMV to evaluate the performance model from Section 3. Finally, we also evaluate the data traffic estimates for the COO SpMV kernel in Algorithm 3.

Related work

The cache simulation method we have presented builds on analytical cache models [1] and trace-driven memory simulation [35], both of which are well known methods for studying cache performance. In their survey, Uhlig and Mudge [35] compare a number of advanced tools for trace-driven memory simulation that cope with various cache configurations, such as associativity and replacement policies. Our approach is to develop a model that is as simple as possible, but accurate enough to diagnose

Conclusion

The performance of irregular, bandwidth-limited computations, such as SpMV, is dictated by data transfers between levels of a CPU’s memory hierarchy. Even though it is fairly easy to acquire worst- and best-case estimates of the data traffic, these estimates are not always sufficient for locating and quantifying bottlenecks because a precise characterisation of irregular data traffic is missing. We have presented a cache simulation method that accurately quantifies data traffic in a multi-core

CRediT authorship contribution statement

James D. Trotter: Conceptualization, Methodology, Software, Investigation, Writing - original draft, Writing - review & editing. Johannes Langguth: Conceptualization, Methodology, Validation, Supervision, Writing - review & editing. Xing Cai: Conceptualization, Methodology, Validation, Funding acquisition, Project administration, Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the Research Council of Norway under contract 251186. Also, the research presented in this paper has benefited from the Experimental Infrastructure for Exploration of Exascale Computing (eX3), which is financially supported by the Research Council of Norway under contract 270053.

James D. Trotter is currently working towards a Ph.D. at the Simula Research Laboratory and University of Oslo, Norway. He received his B.S. degree in Computational Science and Mathematics and M.S. degree in Computational Science and Engineering from the University of Oslo in 2012. His research interests include parallel programming, high-performance computing and numerical methods for solving PDEs.

References (46)

HerasD. et al.
Modeling and improving locality for the sparse-matrix–vector product on cache memories
Future Gener. Comput. Syst.
(2001)
LangguthJ. et al.
Parallel performance modeling of irregular applications in cell-centered finite volume methods over unstructured tetrahedral meshes
J. Parallel Distrib. Comput.
(2015)
WilliamsS. et al.
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
Parallel Comput.
(2009)
YzelmanA.N. et al.
Two-dimensional cache-oblivious sparse matrix–vector multiplication
Parallel Comput.
(2011)
AgarwalA. et al.
An analytical cache model
ACM Trans. Comput. Syst.
(1989)
AhoA. et al.
Principles of optimal page replacement
J. ACM
(1971)
AkbudakK. et al.
Hypergraph partitioning based models and methods for exploiting cache locality in sparse matrix-vector multiplication
SIAM J. Sci. Comput.
(2013)
BallardG. et al.
Communication lower bounds and optimal algorithms for numerical linear algebra
Acta Numer.
(2014)
BenderM. et al.
Optimal sparse matrix dense vector multiplication in the I/O-model
Theory Comput. Syst.
(2010)
BuluçA. et al.
Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks

ÇatalyürekÜ.V. et al.

Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication

IEEE Trans. Parallel Distrib. Syst.

(1999)

de la CruzR. et al.

Modeling stencil computations on modern HPC architectures

DavisT. et al.

The university of Florida sparse matrix collection

ACM Trans. Math. Softw.

(2011)

EranianS. et al.

perfmon2: improving performance monitoring on Linux

(2018)

FilipponeS. et al.

Sparse matrix-vector multiplication on GPGPUs

ACM Trans. Math. Software

(2017)

FrigoM. et al.

Cache-oblivious algorithms

ACM Trans. Algorithms

(2012)

GoumasG. et al.

Performance evaluation of the sparse matrix-vector multiplication on modern architectures

J. Supercomput.

(2009)

HaaseG. et al.

A Hilbert-order multiplication scheme for unstructured sparse matrices

Int. J. Parallel Emergent Distrib. Syst.

(2007)

Intel CorporationD.

Intel^{\protect \relax \special {t4ht=®}} 64 and IA-32 Architectures Software Developer’s Manual: Volume 3 (3A, 3B, 3C & 3D): System Programming Guide

(2017)

Intel CorporationD.

Intel^{\protect \relax \special {t4ht=®}} 64 and IA-32 Architectures Optimization Reference Manual

(2018)

KarsavuranM.O. et al.

Locality-aware parallel sparse matrix-vector and matrix-transpose-vector multiplication on many-core processors

IEEE Trans. Parallel Distrib. Syst.

(2016)

KreutzerM. et al.

A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units

SIAM J. Sci. Comput.

(2014)

LangguthJ. et al.

Scalable heterogeneous CPU-GPU computations for unstructured tetrahedral meshes

IEEE Micro

(2015)

Cited by (2)

Modelling Data Locality of Sparse Matrix-Vector Multiplication on the A64FX
2023, ACM International Conference Proceeding Series
The Effects of Wide Vector Operations on Processor Caches
2020, Proceedings - IEEE International Conference on Cluster Computing, ICCC

Johannes Langguth is a research scientist at Simula Research Laboratory, Norway. He received his Ph.D. in Computer Science from the University of Bergen, Norway in 2011, and master degrees in Computer Science and Economics from University of Bonn, Germany. After a postdoctoral appointment at ENS Lyon, France, he joined Simula in 2012. His research focuses on the design of discrete algorithms for irregular problems on parallel heterogeneous architectures, such as multi-core CPUs and GPUs, and their applications in scientific computing, graph analytics, machine learning, computational social science, and high-performance codes for cardiac electrophysiology.

Xing Cai received his Ph.D. in Scientific Computing from the Department of Informatics at the University of Oslo in 1998. In 1999, he was appointed to the position of associate professor at the University of Oslo, and was promoted to full professorship in 2008. He joined Simula at its very beginning in 2001, taking an 80% leave from his university position. His research interests include parallel programming and high-performance scientific computing on multi-core CPUs and GPUs, numerical methods for solving PDEs, and generic PDE software. He has participated in numerous PDE-related software projects, most noticeably as the principal developer of the Parallel Toolbox within Diffpack, which is now a commercial product marketed by InuTech, Germany.

View full text

Cache simulation for irregular memory traffic on multi-core CPUs: Case study on performance models for sparse matrix–vector multiplication

Highlights

Abstract

Introduction

Section snippets

Quantifying data traffic for irregular, parallel computations

A performance model based on data traffic and bandwidth

Sparse matrix–vector multiplication

Numerical experiments

Related work

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Future Gener. Comput. Syst.

J. Parallel Distrib. Comput.

Parallel Comput.

Parallel Comput.

An analytical cache model

ACM Trans. Comput. Syst.

Principles of optimal page replacement

J. ACM

Hypergraph partitioning based models and methods for exploiting cache locality in sparse matrix-vector multiplication

SIAM J. Sci. Comput.

Communication lower bounds and optimal algorithms for numerical linear algebra

Acta Numer.

Optimal sparse matrix dense vector multiplication in the I/O-model

Theory Comput. Syst.

Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks

Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication

IEEE Trans. Parallel Distrib. Syst.

Modeling stencil computations on modern HPC architectures

The university of Florida sparse matrix collection

ACM Trans. Math. Softw.

perfmon2: improving performance monitoring on Linux

Sparse matrix-vector multiplication on GPGPUs

ACM Trans. Math. Software

Cache-oblivious algorithms

ACM Trans. Algorithms

Performance evaluation of the sparse matrix-vector multiplication on modern architectures

J. Supercomput.

A Hilbert-order multiplication scheme for unstructured sparse matrices

Int. J. Parallel Emergent Distrib. Syst.

Intel\protect \relax \special {t4ht=®} 64 and IA-32 Architectures Software Developer’s Manual: Volume 3 (3A, 3B, 3C & 3D): System Programming Guide

Intel\protect \relax \special {t4ht=®} 64 and IA-32 Architectures Optimization Reference Manual

Locality-aware parallel sparse matrix-vector and matrix-transpose-vector multiplication on many-core processors

IEEE Trans. Parallel Distrib. Syst.

A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units

SIAM J. Sci. Comput.

Scalable heterogeneous CPU-GPU computations for unstructured tetrahedral meshes

IEEE Micro

Intel^{\protect \relax \special {t4ht=®}} 64 and IA-32 Architectures Software Developer’s Manual: Volume 3 (3A, 3B, 3C & 3D): System Programming Guide

Intel^{\protect \relax \special {t4ht=®}} 64 and IA-32 Architectures Optimization Reference Manual