Distinguishing sharing types to minimize communication in software distributed shared memory systems

doi:10.1016/S0164-1212(00)00055-8

Journal of Systems and Software

Volume 55, Issue 1, 5 November 2000, Pages 73-85

https://doi.org/10.1016/S0164-1212(00)00055-8 Get rights and content

Abstract

Using thread migration to redistribute threads to processors is a common scheme for minimizing communication needed to maintain data consistency in software distributed shared memory (DSM) systems. In order to minimize data-consistency communication, the number of shared pages is used to identify the pair of threads that will cause the most communication. This pair of threads is then co-located on the same node. Thread pairs sharing a given page can be classified into thee types, i.e., read/read (r/r), read/write (r/w) and write/write (w/w). Based on memory-consistency protocol, these three types of sharing generate distinct amounts of data-consistency communication. Ignoring this factor will mispredict the amount of communication caused by cross-node sharing and generate wrong decisions in thread migration. This paper presents a new policy called distinguishing of types sharing (DOTS) for DSM systems. The basic concept of this policy is to classify sharing among threads as r/r, r/w or w/w, each with a different weight, and then evaluate communication cost in terms of these weights. Experiments show that considering sharing types is necessary for minimization of data-consistency communication in DSM. Using DOTS for thread mapping produces more communication reduction than considering only the number of shared pages.

Introduction

Software distributed shared memory (DSM) Li, 1988, Bershad and Zekauskas, 1993, Amza et al., 1996 provides an easy user interface for programmers to cluster personal computers or workstations in networks for massive computation. The threads of applications running on DSM communicate with each other via shared variables instead of message passing. Whenever the threads intend to access shared data, data consistency is transparently maintained by the DSM mechanism. This makes the programming on a distributed system easier since programmers do not need to put much effort into data transfer. However, program performance is not as good as systems employing explicit message passing. One of main reasons is that DSM usually generates more network traffic than message-passing systems due to the problem of data consistency (Lu et al., 1995). Accordingly, minimizing data-consistency communication is an important issue in DSM systems. One solution is the use of thread migration to co-locate threads that need data communication with each other on the same node.

Previous work (Thitikamol and Keleher, 1999a) used sharing degree, i.e., the number of shared pages, to predict the amount of communication caused by a pair of threads if they are located on different nodes. Then, they locate to the same node the pairs of threads that show the highest degree of mutual sharing, expecting thereby a maximum communication reduction. In addition, the communication cost of a mapping between threads and processors is estimated by cut cost, i.e., the total sharing degree of the pairs of threads on different nodes. Whether a new mapping is better for communication reduction than an old one is determined by comparing their cut costs.

However, thread pairs with the highest sharing degree located on different nodes do not necessarily create the most communication. The type of operation being performed on a shared page is also an important factor for determining the amount of created communication. Considering the possible operations that can be performed concurrently by two threads on one page yields three types: read/read (r/r), read/write (r/w) and write/write (w/w). Based on release memory consistency Gharachorloo et al., 1990, Keleher et al., 1995, a processor can delay making visible to other threads located on other processors any data modification by local threads until the next synchronization call takes effect. If a pair of threads located on distinct nodes r/r share a page, then their sharing need induce no data-consistency communication at synchronization points. For r/w sharing, an update needs to be sent from one node to another. For w/w sharing, updates need to be exchanged between both execution nodes. Obviously, the three types of sharing require different amounts of communication for shared page consistency. Ignoring this factor will mispredict the amount of communication caused by cross-node sharing and generate erroneous thread migration decisions.

Assuming the computational capability of each node and the computational needs of each thread are identical, load balance is achieved by assigning the same number of threads to each node, as in the example shown in Fig. 1. However, the thread mapping of this example is not optimal for minimizing communication cost. The system first tries to map t₂ and t₃ on the same node since the sharing degree of this thread pair is the highest, and then tries to locate t₁ and t₄ to another. The cut-cost of this new mapping is 6, which is less than the previous, so the system decides this mapping will reduce communication. In fact, however, the amount of data-consistency communication is not reduced by this thread mapping. A better mapping for this example is achieved by assigning t₁ and t₃ to one node and t₂ and t₄ to another. The main reason for the above mistaken decision is that the amount of communication caused by each type of sharing is regarded as equal. The correct thread migration decision requires consideration of the sharing type.

Based on the above discussion, we propose distinguishing of types sharing (DOTS), a policy that utilizes data sharing type when making decisions for minimizing data-consistency communication in DSM systems. The basic concept of DOTS is to classify sharing among threads as r/r, r/w or w/w, each with a different weight, and then evaluate communication cost in terms of these weights. DOTS was experimentally implemented in a DSM system called Cohesion. The fault handler of the system was extended to track the types of sharing among threads. Two new metrics, type-sharing degree and type-sharing cost, were used to identify maximum communication thread pairs, and to judge whether an alternative thread mapping would be better for communication reduction. A set of loop applications was tested on the system for performance evaluation, and will be discussed below. The experimental results confirm that distinguishing of sharing type is significant for communication minimization in DSM systems. Thus, DOTS thread mapping is more effective for reducing data-consistency communication than considering only the number of shared pages.

The rest of this paper is organized as follows. Section 2 is the overview of test bed. Section 3 is the analysis of data-consistency communication on Cohesion. The implementation and performance of DOTS on Cohesion are described in 4 Implementation, 5 Performance, respectively. Section 6 discusses previous work related to exploiting data sharing to minimize communication for software DSM systems. Finally, Section 7 presents conclusions and future work.

Section snippets

Overview of test bed

The test bed used in this study is a page-based DSM system called Cohesion (Shieh et al., 1995), which is built on a cluster of Intel 80×86 computers connected with Ethernet. Cohesion provides a global shared memory address space and multiple memory-consistency models using sequential and eager release to maintain the shared memory consistency. The shared address space of a program running on Cohesion can typically be divided into three parts: object-based memory (migratory protocol),

Theoretical analysis

The amount of communication caused by data sharing among threads in a DSM system is dependent on the adopted memory-consistency algorithm. Therefore, this section analyzes the algorithm in terms of memory-consistency cost on Cohesion. The analysis concentrates only on the consistency cost of release memory, since most of the shared data created in user programs is allocated as release memory. On the other hand, the applications used in this paper are loop-based problems. Loop applications are

Implementation

The implementation of DOTS on Cohesion involves two things. One is deriving and tracking the sharing types. The other is using the sharing types for minimizing data-consistency communication when reallocating threads to execution processors. Since Cohesion supports access pattern tracking and dynamic scheduling, we need only modify the mechanism of access tracking and the algorithm of thread mapping to accomplish the two implementation requirements.

Performance

Experiments in this study were completed on eight personal computers. Each node was a 90 MHz Intel Pentium processor equipped with 32 Mbyte memory. The network used to connect all of the execution processors was a 10 Mbps Ethernet. All of the system's resources were dedicated to running benchmark applications during the experiments. The system overhead is listed in Table 2. The remainder of this section contains an introduction to the applications, our experimental results and discussion.

Related works

Currently, work related to exploiting data sharing for thread migration in DSM includes Lai et al., 1997, Schuster and Shalev, 1998. Their work is dedicated to developing a policy of choosing threads for migration in order to balance the workload of processors while simultaneously reducing communication. Since the access pattern of a processor is an accumulation of the access patterns of all the local threads, the access pattern of source and destination nodes may alter after thread migration.

Conclusions and future work

This paper presents a novel policy, DOTS, which improves the exploitation of data sharing for minimizing data-consistency communication in DSM systems. This policy classifies sharing among threads into r/r, r/w and w/w, each with a different weight, and then evaluates communication costs in terms of these weights. Tracking of sharing types among threads is discussed and then demonstrated during thread mapping on an experimental test-bed implementation of a set of applications. The experimental

Tyen-Yeu Liang is currently a Ph.D. candidate studying at Electrical Engineering Department of National Cheng Kung University in Taiwan. Liang received his BS degree from National Cheng Kung University in 1992 and MS degree from National Cheng Kung University in 1994. His research interest is parallel and distributed processing, metacomputing, load balancing, scheduling, neural network and image processing.

References (21)

P. Keleher et al.
An evaluation of software-based release consistent protocols
Journal of Parallel and Distributed Computing, Special Issue on Distributed Shared Memory
(1995)
C. Amza et al.
Treadmarks: shared memory computing on networks of workstations
IEEE Computer
(1996)
B.N. Bershad et al.
Presto: a system for object-oriented parallel programming
Software-Practice and Experience
(1988)
Bershad, B.N., Zekauskas, M.J., 1993. The midway distributed shared memory system. In: Proceedings of the IEEE COMPCON...
Chase, J.S., Amador, F.G., Lazowska, E.D., Levy, H.M., Littlefield, R.J., 1989. The amber system: parallel programming...
Carter, J.B., Bennett, J.K., Zwaenepoel, W., 1991. Implementation and performance of munin. In: Proceedings of the 13th...
R. Friedman et al.
Millipede: easy parallel programming in available distributed environments
Software-Practice and Experience
(1997)
Gersho, A., Gray, R.M., 1992. Vector Quantization and Signal Compression. Kluwer Academic Publishers,...
Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A., Hennessy, J., 1990. Memory consistency and event...
Hu, W., Shi, W., Tang, Z., 1999a. JIAJIA: an SVM system based on a new cache coherence, protocol. In: Proceedings of...

There are more references available in the full text version of this article.

Cited by (5)

Fast estimation of communication cost for thread mapping in computation grids using SVD
2008, Proceedings - 8th International Conference on Intelligent Systems Design and Applications, ISDA 2008
Incorporating memory resource considerations into the workload distribution of software DSM systems
2007, Journal of Information Science and Engineering
Finding a suitable system scale to optimize program performance on software DSM systems
2006, Cluster Computing
Teamster-G: A grid-enabled software DSM system
2005, 2005 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2005
A new approach to distribute program workload on software DSM clusters
2004, Proceedings of the International Symposium on Parallel Architectures, Algorithms and Networks, I-SPAN

Jyh-Cheng Ueng is currently a Ph.D. candidate in the department of Electrical Engineering at National Cheng Kung University. His research interests focus on distributed shared memory and reconfiguration. Ueng received the BS degree from National Sun Yat-Sen University in 1991 and MS degree from National Cheng Kung University in 1993.

Ce-Kuen Shieh is currently a professor teaching at the Department of Electrical Engineering, National Cheng Kung University. He received his Ph.D., MS, and BS degrees from Electrical Engineering Department of National Cheng Kung University. His current research interests include distributed and parallel processing, operating systems, computer networking, and compiler.

Deh-Yuan Chuang is currently a captain in the army of Taiwan. He received his BS degree and MS degree from National Cheng Kung University in 1997 and 1999, respectively. His research focuses on parallel and distributed processing and performance optimization.

Jun-Qi Lee is currently a master student. He received his BS degree from National Cheng Kung University in 1998. He will receive his MS degree from National Cheng Jung University in July 2000. His research interest focuses on parallel processing and web applications.

View full text

Distinguishing sharing types to minimize communication in software distributed shared memory systems

Abstract

Introduction

Section snippets

Overview of test bed

Theoretical analysis

Implementation

Performance

Related works

Conclusions and future work

Journal of Parallel and Distributed Computing, Special Issue on Distributed Shared Memory

Treadmarks: shared memory computing on networks of workstations

IEEE Computer

Presto: a system for object-oriented parallel programming

Software-Practice and Experience

Millipede: easy parallel programming in available distributed environments

Software-Practice and Experience