Simulation as a tool for optimizing memory accesses on NUMA machines

https://doi.org/10.1016/j.peva.2004.10.003Get rights and content

Abstract

Due to the inherent non-uniformity in the memory system, programmers and users of non-uniform memory access (NUMA) machines have to take special care of the memory performance of their applications. This paper discusses a variety of potential improvements with respect to cache misses, cache invalidations, and inter-node communication. This study is based on the simulation tool SIMT, which models the memory hierarchy in detail and is capable of providing complete, accurate information about all dynamic memory references. This information can be used to analyze the memory access behavior of applications and thereby forms the basis for any optimization with respect to memory accesses.

Introduction

Parallel architectures provide the potential for achieving substantially higher performance than traditional uniprocessor architectures and are therefore commonly used for high-performance computing. These multiprocessors can be, from an architectural point of view, divided into two categories: tightly coupled machines with global memory and loosely coupled machines with memories distributed across processor nodes. While the former maintain a hardware coherent global memory accessible from any processor with the same access cost, the latter traditionally enable the transfer of data over processors with explicit messages. A single address space abstraction, however, can eliminate the need for explicit data partitioning and thereby simplify parallel programming. Hence, virtual shared memory systems that provide such an abstraction even on distributed memory machines are increasingly being used in modern clusters of workstations or PCs (CoWs). They allow any processor to directly reach any memory location using regular load/store operations. Such systems, however, introduce different latencies for local and remote memory accesses. This forms a new class of system architectures, called non-uniform memory access (NUMA) systems.

NUMA systems, however, are usually burdened with an additional performance problem caused by their NUMA characteristics. In such systems, any memory access to global memory can either be intended for local or remote memory modules with different latency properties. For many codes, this difference can lead to extensive remote memory accesses, especially with rising number of nodes, and thereby a higher percentage of remote memory operations in the overall system. As accesses to remote memory normally exhibit a significant higher latency, this in turn leads to severe overheads in the parallel execution of programs.

This work targets such performance problems with the goal of exploring potential improvements on the entire memory system, including applications, memory management, and protocols. We focus on three performance issues that are particularly critical for NUMA architectures: cache locality, cache line invalidation, and memory locality. The first two focus on the cache hit ratio in order to reduce the total number of memory references. This can be done by removing cache access bottlenecks and using appropriate cache coherence schemes capable of causing less invalidations. The last issue directly targets remote memory accesses and aims at reducing the inter-node communication by specifying correct data distributions.

The prerequisite for all of this is precise information about runtime cache and memory access behavior. Traditionally, such information is acquired either by hardware counters or simulation tools. The former relies on a small set of registers usually deployed by modern microprocessors for monitoring the occurrences of specific events and providing valuable information about the performance of critical regions in programs. This information is restricted to very specific, mostly global events, like the total number of cache misses or the number of memory accesses, and therefore is often not sufficient for a comprehensive optimization. The simulation approach, on the other hand, provides the ability to acquire extensive and complete performance information, as well as to study the impact of specific optimizations.

Following the latter approach, we have developed a simulation tool called SIMT, which models the parallel execution of shared memory applications on multiprocessor architectures. SIMT is specifically designed for measuring the performance of the memory system and contains mainly mechanisms for modeling caches, the distributed shared memory, and the data transfer between processor nodes. Another feature of SIMT is its data collection and preprocessing mechanism in the form of a flexible monitor simulator. It can be connected to any location in the memory system and collects complete and accurate data about all memory references. Based on this monitoring mechanism, SIMT provides detailed performance data in the form of memory access histograms. This information is then used to find memory access bottlenecks, as well as potential optimizations. To enable an easy analysis of the performance data, we have developed a visualization tool to show the access behavior and patterns in relation to source code data structures. This allows the users to detect access hot spots and to correct inappropriate memory allocations.

The rest of this paper is organized as follows: Section 2 briefly describes the challenges we face on NUMA systems. In Section 3, we present the simulation tool SIMT and its mechanisms for modeling the memory hierarchy. In addition, Section 3 gives an overview of the visualization tool for presenting performance data. Section 4 shows the first experimental results, together with the discussion of how different techniques influence the memory performance. In Section 5, we contrast our approach with existing work and in Section 6, the paper concludes with a short summary and a few future directions.

Section snippets

Performance challenges on NUMA systems

As described above, NUMA machines provide the programmer with a global virtual memory abstraction established either by using pure software or a combination of software and hardware support. The advantage of such approaches, compared to hardware coherent shared memory systems, is their high cost-effectiveness and scalability. However, the considerable distinction between accesses to different memory locations often causes poor performance of parallel applications.

A typical NUMA machine is

SIMT: an evaluation platform for multiprocessor systems

Increasing complexity of system hardware and software has led to a greater need of adequate simulation tools for the evaluation of parallel programs on multiple target architectures. For this purpose, many simulation systems have been developed. Prominent systems include the most comprehensive simulation tool SimOS [4] from Stanford University, the memory-oriented SIMICS [10] from Swedish Institute of Computer Science, and the Wisconsin Wind Tunnel (WWT) [11] from the University of Wisconsin.

Sample memory optimization using SIMT

Based on SIMT and its visualization tool, we have performed several optimizations for various benchmarks on NUMA memory system. First, we analyzed the cache access behavior of applications and studied the impact of optimizations on cache locality. We then measured the impact of cache coherence protocols and studied how they influence the execution time. Furthermore, we compared different data allocation polices with the optimal data placements detected by analyzing the visualized access pattern

Related work

Due to the considerable difference between latencies to access local and remote memory, locality optimization on NUMA machines has been increasingly addressed. As a result, several approaches and techniques have been proposed. Theses schemes can be roughly divided into three categories: (1) compiler-based optimization [9], [12], (2) automatic optimization based on data migration [18], and (3) manual optimization based on profiling [2].

Krishnamurthy and Yelick [9] describe compiler analysis and

Conclusion

In this paper, we present research work on improving the performance of the memory system on NUMA machines. Several techniques have been proposed including adaptive mechanisms for correcting cache access bottlenecks, competitive cache coherence protocols for reducing invalidations, and locality optimizations for decreasing remote accesses. These techniques have been validated using a multiprocessor simulator, which models the memory hierarchy in detail and uses specific mechanisms to provide

References (19)

  • A. Krishnamurthy et al.

    Analysis and optimization for shared space programs

    J. Parallel Distribut. Comput.

    (1996)
  • J. Archibald

    A cache coherence approach for large multiprocessor systems

  • D. Cortesi

    Origin2000 and Onyx2 Performance Tuning and Optimization Guide

    (1998)
  • H. Hellwagner, A. Reinefeld (Eds.), SCI: Scalable Coherent Interface: Architecture and Software for High-performance...
  • S.A. Herrod, Using complete machine simulation to understand computer system behavior, Ph.D. Thesis, Stanford...
  • R. Hockauf, W. Karl, M. Leberecht, M. Oberhuber, M. Wagner, Exploiting spatial and temporal locality of accesses: a new...
  • IEEE Computer Society, IEEE Standard for the scalable coherent interface (SCI), IEEE Std 1596–1992, IEEE 345 East 47th...
  • R. Iyer, H. Wang, L. Bhuyan, Design and Analysis of Static Memory Management Policies for CC-NUMA Multiprocessors,...
  • P. Keleher et al.

    TreadMarks: distributed shared memory on standard workstations and operating systems

There are more references available in the full text version of this article.

Cited by (4)

1

Tel.: +1 607 2554997; fax: +1 607 2559072.

View full text