Performance improvement of parallel programs on a broadcast-based distributed shared memory multiprocessor by simulation

https://doi.org/10.1016/j.simpat.2007.11.015Get rights and content

Abstract

Due to advances in fiber optics and VLSI technology, interconnection networks that allow simultaneous broadcasts are becoming feasible. Distributed shared memory (DSM) implementations on such networks promise high performance even for small applications with small granularity. This paper, after summarizing the architecture of one such implementation called the Simultaneous Multiprocessor Optical Exchange Bus (SOME-Bus), presents simple algorithms for improving the performance of parallel programs running on the SOME-Bus multiprocessor implementing cache-coherent DSM. The algorithms are based on run-time data redistribution via dynamic page migration protocol. They use memory access references together with the information of average channel utilization, average channel waiting time, number of messages in the channel queue or short-term average channel waiting time reported by each node and gathered by hardware monitors to make correct decisions related to the placement of shared data. Simulations with four parallel codes on a 64-processor SOME-Bus show that the algorithms yield significant performance improvements such as reduction in the execution times, number of remote memory accesses, average channel waiting times, average network latencies and increase in average channel utilizations.

Introduction

High performance computing is required for many applications, including the modeling of weather patterns, atomic structure of materials and other physical phenomena as well as image processing, simulation of integrated circuits and other applications known as “Grand Challenge” problems. Scalable architectures capable of addressing these application classes are formed by interconnecting large numbers of microprocessor-based processing nodes in order to create distributed memory multiprocessors. The effectiveness of these types of multiprocessor systems is determined by the interconnection network architecture and the programming model supported by the system.

DSM is a common programming model supported by multiprocessor systems, along with message passing and data parallel processing. The programming model is important because it affects the amount of operating system overhead involved in communication operations as well as the level of involvement required by the programmer to specify the processor interaction required by the application. The message passing paradigm requires a higher level of programmer involvement and knowledge of the details of the underlying communication subsystem in order to explicitly direct the interprocessor communication. With the message passing paradigm, the distributed nature of the memory system is fully exposed to the application programmer. The programmer needs to keep in mind where the data are, decide when to communicate with other processes, whom to communicate with, and what to communicate. This makes programming for message passing paradigm hard, especially for large applications with complex data structures [13]. DSM offers the application programmer a model for using shared data that are identical with those used when writing sequential programs, thereby reducing the complexity involved in developing distributed applications [8]. Therefore, many parallel applications are easier to formulate and solve using the shared memory paradigm rather than message passing.

A DSM system can be viewed as a set of nodes or clusters, with local memories, communicating over an interconnection network. On each access to shared space, it must be determined if the requested data are in the local memory, and if not, the data must be copied from remote memory. Actions are also needed when data are written in shared space to preserve the coherence of shared data.

Software-based DSM systems [9], [19], [20] are mostly implemented on top of commercial operating systems running on networks of workstations connected by a switched network. Round trip latencies for small messages tend to be several hundred microseconds mostly due to software delays. Use of large pages as coherence units causes additional network traffic due to false sharing, and as a result, applications exhibiting large granularity show reasonable speed up, while applications with smaller granularity show little improvement. It is therefore important that interconnection networks be designed with high bisection bandwidth and low latency to provide the best possible performance in DSM systems. The addition of full hardware support allows the true benefits of DSM to become a reality. A fully hardware supported DSM requires very little support from the operating system and consequently the latencies experienced are much smaller. This is critical as DSM systems typically have a large percentage of multicast traffic due to invalidation messages in write-invalidate cache coherence protocols and update messages in write-update protocols. The correct organization and design of the interconnection network becomes a critical factor in that case, especially as processors become faster and are replaced by symmetric multiple processors (SMPs).

The effects of interconnection network properties and data consistency protocols have been the focus of extensive research. A DSM multiprocessor based on a two dimensional mesh is examined in [14] using a queuing network model and simulation. For large values of remote memory request probability, it is observed that the interconnection network saturates, and the processor utilization stays below 35%. A study of four architectures with hardware support of shared memory is reported in [6]. Significant latency is found, even under optimistic assumptions, especially in characterizing the cache misses which result in traffic over the interconnection network, as evidenced by the very small network utilization in three architectures. A DSM implementation on a 16-node nCUBE is described in [1]. Experiments with four parallel programs show reduced performance of matrix operations on distributed data requiring a significant amount of data transfer time compared to node intensive computation time. It is observed in [1] that such programs are unsuitable for DSM unless a technique can be found to reduce the communication.

High performance (and high complexity) interconnection networks have also been proposed [2]. The distributed crossbar switch hypermesh is examined in [16], where blocking probabilities and average values of message delays are calculated. Similarly, an optical implementation of hypermeshes using electrical and optical crossbars is examined in [21]. Although multiple wavelengths are used, multiple senders may use the same wavelength, requiring contention resolution.

An interconnection network which can offer an alternative to current networks relies on one to all broadcast, where each processor can directly communicate with any other processor. The most useful properties of such a network are high bandwidth (scaling directly with the number of processors), low latency and no arbitration delay. One implementation is the SOME-Bus [10] which can be constructed using optoelectronic devices relying on sources, modulators, and array of detectors, all being coupled to local commercial off-the-shelf electronic processors.

Although statistical simulation of the SOME-Bus multiprocessor shows promising results [10], [11], real parallel applications running on the SOME-Bus multiprocessor show poor to moderate performance mostly due to excessive remote memory references and load imbalance among the processors [8]. Dynamic page migration [3], [21], [14] is a technique that can potentially alleviate the problem mentioned. The idea behind dynamic page migration is to collect per-node reference information for each page in memory and migrate a page to a remote node if the reference counters indicate that the remote node accesses the page more frequently compared to the home node. Reference [3] proposes a competitive strategy to migrate pages using special hardware support (a counter for each page-frame). By using a trace-driven simulation, they evaluate a few small scientific applications and obtain speedups of 5–10. For the validity of the simulations related to the execution of parallel applications on multiprocessors, the synchronization primitives must be taken into account. There is no evidence in their work that it is the case. Also, static multiprocessor address traces are not representatives of the real address streams when cache blocks are moved dynamically during the execution of a program. Reference [22] studies the improvements of performance on Cache Coherent Non-Uniform Memory Access (CC-NUMA) systems, provided by operating system supported dynamic migration. This kind of migration is based on the information about full cache misses collected via instrumenting the operating system. Hot pages, i.e., pages to which a large number of misses are occurring, are migrated if referenced primarily by one process. Results of their experiments show a performance increase up to 29% for some workloads on an 8-node CC-NUMA system, which is not a typical representative of today’s high-performance scalable architectures. Reference [15] presents two algorithms for moving virtual memory pages to the nodes that reference them more frequently. The purpose of this page movement is the minimization of the worst case latency incurred in remote memory accesses. Their first algorithm works on iterative parallel programs and is based on the assumption that the page reference pattern of one iteration will be repeated throughout the execution of the program. The second proposed algorithm checks periodically for hot memory areas and migrates the pages with excessive remote references. Both algorithms assume compiler support for identifying hot memory areas.

References [22], [15] make their migration decisions according to the memory access histograms gathered by software with support of the operating system, the compiler, or other memory management mechanisms. The approach of using software to gather the memory access histograms relies on software instrumentation and thus requires both a time and space overhead. The code of the relevant software handlers must be modified to count remote memory accesses from each processor to each page in memory. The inserted counting code increases the latency of the handler as well as the application’s execution time. Counting remote memory misses also incurs a space overhead because the policy requires software counters for each processor and for each page. Software-assisted policy decision and policy end mechanisms (i.e., when a page should be migrated and how the migration will end) are also time consuming. It has been shown in [22] that these two mechanisms create about 10% kernel overhead. Therefore, the approach of using software to gather the memory access histograms is associated with a high overhead. In order to avoid this problem, we propose a dynamic page migration protocol that uses simple transactions and can be implemented in hardware. The developed algorithms based on this protocol can establish their migration decision based on information gathered by hardware monitors with only a minimal overhead and without the involvement of compilers, the user, and any system software.

In our algorithms, instead of just using memory accesses, each node also monitors its own channel condition to determine if it is becoming overloaded and, based on this decision, migrates hot pages to nodes that are not overloaded, to dynamically and transparently modify the data layout. In this way, incorrectly allocated data are moved to other nodes causing an equalization of remote memory accesses. Pages being accessed frequently by their home node are not migrated and hence data locality are preserved. Simulations with four parallel codes on a 64-processor SOME-Bus show that the algorithms yield significant performance improvements such as reduction in the execution times, number of remote memory accesses, average channel waiting times, average network latencies and increase in the average channel utilizations.

Section 2 summarizes the SOME-Bus multiprocessor architecture. Section 3 presents details of the dynamic page migration algorithms. Section 4 presents simulation framework and overhead analysis. Section 5 gives results when the algorithms are applied on the selected parallel applications.

Section snippets

The SOME-Bus architecture

The SOME-Bus incorporates optoelectronic devices into a high performance network architecture. It is a low-latency, high-bandwidth, fiber-optic interconnection network which directly connects each node to all other nodes. One of its key features is that each of N nodes has a dedicated broadcast channel operating at several GB/s, realized by a group of wavelengths in a specific fiber, and an input channel interface based on an array of N receivers which simultaneously monitors all N channels,

Motivation

Our algorithms are motivated by the fact that shared memory programs running on DSM machines may suffer from excessive remote memory references and memory load imbalance, and consequently they show poor to moderate performance. While the performance of some applications can be improved by manually optimizing the source code with respect to data placement, others that exhibit dynamically changing access patterns can only be tuned by run-time redistribution of data or computation. Our algorithms

Simulation and the simulated architecture

Since SOME-Bus multiprocessor architecture has not been built in hardware yet, we have developed an execution-driven simulator which provides a detailed model of the processor, directory controller, cache controller, channel controller and the DSM operation of every node on the SOME-Bus. The architecture simulated in this paper has 64 nodes. Each node has a processor, a cache controller, a directory controller and a channel controller. The cache controller fills requests for data from the

Performance improvement

All simulation results are based on time units equal to one clock cycle. The performance of the algorithms on the SOME-Bus multiprocessor is evaluated in terms of the number of simulation cycles required to execute an application, channel utilizations, average channel waiting times and average network latencies. Algorithm 1 uses utilization threshold TU = 0.60 and Algorithm 2 and 3 use waiting time threshold TW = 360. Algorithm 4 uses queued messages threshold TN = 12. For all migration schemes, we

Conclusions and future work

In this paper, simple algorithms that improve the performance of parallel programs running on a broadcast-based DSM multiprocessor is proposed. The algorithms are based on a dynamic page migration protocol that can be implemented in hardware. The proposed algorithms treat the target application as a black box and do not require programmer intervention, application-specific knowledge or any specific memory access pattern such as periodicity.

An execution-driven simulator was developed, which

Acknowledgment

The authors would like to thank Cukurova University Scientific Research Projects Center for supporting this work (Project No. MMF2007BAP15).

References (23)

  • Y.C. Hu, Lu Honghui, A.L. Cox, W. Zwaenepoell, OpenMP for network of SMPs, in: Proc. of the 13th Int. Symp. on Parallel...
  • Cited by (14)

    • A new congestion control algorithm for improving the performance of a broadcast-based multiprocessor architecture

      2010, Journal of Parallel and Distributed Computing
      Citation Excerpt :

      Compared with the case where the algorithm is not applied, the proposed algorithm is able to decrease the average input waiting time by 13.99% to 20.39%, average network response time by 8.76% to 20.36% and increase average processor utilization by 1.92% to 6.63%. The performance of the algorithm is also compared with that of the other congestion control algorithms using just output queue information such as channel packet count [18,24], average channel utilization [33] and average channel waiting time [1,23]. It is observed that our algorithm performs better under all traffic patterns.

    • Reliable attributes selection technique for predicting the performance measures of a DSM multiprocessor architecture

      2013, Proceedings - 2013 International Conference on Computer, Electrical and Electronics Engineering: 'Research Makes a Difference', ICCEEE 2013
    View all citing articles on Scopus
    View full text