Synchronization Debugging of Hybrid Parallel Programs

Krzikalla, Olaf; Müller-Pfefferkorn, Ralph; Nagel, Wolfgang E.

doi:10.1007/978-3-319-43659-3_3

Olaf Krzikalla¹⁵,
Ralph Müller-Pfefferkorn¹⁵ &
Wolfgang E. Nagel¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9833))

Included in the following conference series:

European Conference on Parallel Processing

2491 Accesses

Abstract

In this paper we address the problem of locating race conditions among synchronization primitives in execution traces of hybrid parallel programs. In hybrid parallel programs collective and point-to-point synchronization can’t be analyzed separately. We introduce a model for synchronization primitives and formally define synchronization races with respect to the model. Based on these concepts we present an algorithm which accurately detects synchronization races and yields a task graph of the execution trace. The task graph represents the guaranteed ordering of events across thread and process boundaries. It is an essential core element for the further analysis (e.g. a data race detection) of a program.

Depending on the synchronization model task graph construction can be an NP-hard problem. Our model allows to construct an algorithm with sub-quadratic time complexity. Thus programs adhering to the principles of our model are provable against race conditions. Therefore we argue, that our model should be used as a foundation for the design and implementation of synchronization functions.

You have full access to this open access chapter, Download conference paper PDF

Optimized Sound and Complete Data Race Detection in Structured Parallel Programs

Thread-Local Semantics and Its Efficient Sequential Abstractions for Race-Free Programs

Static Race Detection for Periodic Real-Time Programs with IPCP Locks

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Exascale systems are expected to exhibit a hybrid architecture. Even contemporary systems are clusters of shared memory nodes. On such systems several levels of parallelism exist, e.g., the node level, the core level, and the SIMD level. In this paper we consider a thread the smallest execution element of a program parallelization. A process consists of a number of threads, with each thread able to call distributed synchronization and communication functions. A hybrid program in turn consists of a set of such processes.

Hybrid programs raise new challenges to debugging and correctness tools. Consider two processes each executing a barrier call twice (Fig. 1a). A tool analyzing the execution traces of the two processes can enumerate the barrier calls of each process and by this means compute the matching barrier calls. Identifying the relation between barrier calls becomes difficult in the presence of a hybrid parallel execution (Fig. 1b). Let’s assume process 1 consists of two threads each executing the barrier once. Thread 2 sends a message to thread 1 in-between the two barrier executions. Thread 1 waits for that message before it executes its barrier. Thus the execution order of the two barriers is determined. However, in order to compute the order it is necessary to take the point-to-point synchronization into account, which happens between thread 2 and thread 1. Without that point-to-point synchronization a synchronization race would arise. It would be undetermined, whether the first barrier call of process 2 matches the barrier call of thread 1 or of thread 2. In practice a concurrent call to the same barrier is often forbidden (e.g. by MPI or GASPI [1, 7]). Due to its non-deterministic nature such an error could cause an untimely program abortion. The other side of the problem is illustrated in Fig. 1c. In this case two point-to-point and one collective synchronization occur. Due to the barrier the first wait at process 2 will wait for the post of thread 1 leading to a determined execution order again. But it is also necessary to take the collective synchronization into account in order to compute the order of point-to-point synchronizations. The conclusion is that an algorithm computing the order of events on the basis of a hybrid parallel program trace cannot handle point-to-point and collective synchronization in two independent steps. Only a consolidated computation of both types of synchronization can yield a task graph, which represents the guaranteed ordering of events.

In the following we will introduce a model, which can be used to describe both point-to-point and collective synchronization. Based on that model we formally explore how races can be detected and how a task graph of a given program trace can be efficiently constructed. Our work is novel as it unifies the handling of point-to-point and collective synchronizations. The major result of our work is an algorithm to analyze the synchronization operations of the trace of an application’s execution to compute its guaranteed orderings. The algorithm requires \(\mathcal {O}(|T|^{2})\) time, where |T| is the number of traced synchronization operations. The algorithm reports synchronization races, which are sequences of synchronization operations leading to a non-deterministic program behavior. In addition we present an important optimization, which decreases the time complexity of the algorithm down to sub-quadratic and makes it highly scalable. We have implemented and evaluated our concept as a tool capable of analyzing hybrid GASPI/OpenMP/Pthreads programs. The task graphs generated by the tool visualize the synchronization relations among the threads and processes in terms of necessity.

2 Model

We derive our model from the classic point-to-point or event-style synchronization model [6] and extend it, that it can handle collective synchronization too. The basic concept is the event. An event has two states: posted and cleared. In the classic model three operations can be performed on an event: POST sets the state of the event to posted; WAIT suspends the calling thread until the state of the event is posted; and CLEAR sets the state of the event to cleared.

Typically, point-to-point synchronizations use simple flags as events. These flags are shared among the threads of a process. Events of collective primitives are handled similar. Every executing element (i.e. a thread or a process) participating in a collective has its own event. A thread being part of an OpenMP barrier has a thread-local event for that barrier. A process participating in an MPI barrier shares the corresponding event among its threads. When a blocking collective is entered by a thread, a POST operation is performed on the corresponding event first. Afterward, a WAIT operation waits until all participating executing elements have entered the collective and set their respective events to posted. Finally, a CLEAR operation is performed before the execution returns from the collective.

In a blocking collective the three primitives POST, WAIT and CLEAR are tied together and executed in that order. In a non-blocking collective (e.g. a split-phase barrier [4]) the POST operation is swapped out to a dedicated enter routine (e.g. upc_notify). WAIT and CLEAR remain tied together in one routine (e.g. upc_wait).

The coupling of WAIT and CLEAR is important. In our model it is not only used for collective but also for point-to-point synchronization. Thus we reduce the classic model to two principal operations:

post(e) or P: sets the state of the event e to posted.
wait(e) or W: suspends the executing thread until the state of the event e is posted. If e belongs to a collective, then W waits until all participating elements have set their respective events to posted. Upon exiting the state of e is set to cleared.

Performing WAIT and CLEAR in one operation is a common practice. It is used on a regular basis in collective synchronization. Another example is the GASPI standard, which resembles the WAIT,CLEAR sequence in the gaspi_notify_reset function. This function resets an event and returns its former state. A caller can choose the further execution path by means of the function result.

A program execution \(\mathcal {P}= \langle E, {\prec }\rangle \) represents a particular execution of a parallel program. E is a finite set of tasks and \({\prec }\) is the happens-before relation defined over E [9]. \(\mathcal {P}\) constitutes a directed acyclic graph with E being the nodes and \({\prec }\) being the edges. We assume a trace of a program as input representing a partial task graph \(\mathcal {P}^{T} = \left\langle E, {\prec }^{T} \right\rangle \). A task in E can be either a post(e) or a wait(e) operation. The event e is part of the input and contains information about the synchronization type. The \({\prec }^{T}\) relation denotes the execution order of the tasks in a thread. It is implicitly given by the input trace. The challenge is to compute the \({\prec }^{S}\) relations, which are induced among threads by the synchronization tasks. If this computation leads to a uniquely determined program execution \(\mathcal {P}= \left\langle E, {\prec }^{T} \cup {\prec }^{S} \right\rangle \), then the input trace \(\mathcal {P}^{T}\) is free of synchronization races.

3 Synchronization Races

Parallel programs can exhibit various forms of non-deterministic behavior, which is caused by race conditions at different levels. Value non-determinacies are the most fundamental race conditions – data races. Data races are generally considered a programming error. However, there are also benign and even intended data races, for example to implement synchronization operations.

Static non-determinacy is a property of the program control flow, which is typically intended and built in the source code. An example are programs where the threads adjust their execution according to the content of received messages (where content may refer to the sender, the message type or the actual data). Stencil codes are representative: halos are processed in the order, in which they are received from neighboring threads. Another form of non-determinacy is mutual exclusion, where two or more synchronization operations intentionally race toward the acquisition of the same resource. Unlike point-to-point synchronization mutual exclusion does not establish directed synchronization relations.

Our notion of synchronization races lead to a form of non-determinacy, which conceptually differs from other forms of non-determinacy. A synchronization race can only occur among point-to-point synchronization operations accessing the same event. In Fig. 2a process 2 issues a P operation, but it is unclear, whether thread 1, thread 2, or both will perceive the posted event and reset it. This depends on the point in time, at which the execution of thread 1 and thread 2 reaches the respective W operations. Figure 2b depicts a race of two posts toward the same wait. If process 1 has entered the wait operation before process 2 executes \(P_{1}\), then process 1 can proceed after \(P_{1}\) and eventually the state of the event is posted after the execution of \(P_{2}\). However, if process 1 doesn’t enter W before process 2 has executed \(P_{2}\), then the state of the event is eventually cleared. Figure 2c is an extension of Fig. 2b. At first glance the execution order seems well defined, since \(P_{1} {\prec }W_{1}\) and \(P_{2} {\prec }W_{2}\). But if process 2 has executed \(P_{1}\) and \(P_{2}\) before process 1 enters \(W_{1}\), then the second post gets lost and process 1 will be stuck in the second wait. This may lead to an unpredictable dead-lock.

We formally define a synchronization race as a specific global program state. A global program state can be seen as a frontier drawn across all threads in between tasks of a task graph [3]. All tasks before the frontier were already executed. Tasks immediately after the frontier are just about to be executed. We call such tasks active. A consistent global state is an execution point, at which all threads could have simultaneously arrived.

Definition 1

A synchronization race exists in a program execution \(\mathcal {P}\), iff a consistent global state exists such that a wait task on an event e is active and

1.
another wait task on e is active or
2.
at least two post tasks on e exist before the frontier and none of them is connected to a wait task before the frontier.

A frontier of a consistent global state can only be crossed by arrows toward the direction of the program execution. Thus a task after a consistent frontier can never happen before a task before the frontier. Figure 3a resembles Fig. 2c and illustrates the concept. The frontier belongs to a consistent global state – all arrows cross the frontier onward. This case constitutes a synchronization race by Definition 1: \(W_{1}\) is active, \(P_{1}\) and \(P_{2}\) are before the frontier and none of them has triggered a wait before the frontier. On the contrary the frontier in Fig. 3b is not consistent any more, since it is crossed by an arrow backwards from P(x) to W(x). In this case it is indeed not possible to construct a consistent frontier such that a synchronization race could be constituted according to Definition 1. Figure 3c applies the frontier concept to a collective synchronization operation in a hybrid environment. The shown frontier separates the enter and leave events (post and wait operations resp.) of the barrier calls B1 and B2. Thus \(W_{B1}\) at thread 2 and \(W_{B2}\) at thread 1 are both active. But this frontier is not consistent, since it is crossed by an arrow backwards due to a point-to-point synchronization from thread 2 to thread 1. Again, a construction of a consistent frontier fulfilling all requirements of Definition 1 is not possible.

The examples give us a hint, how synchronization races can be detected. If P(e) happens after W(e), then these two tasks can never form a synchronization race.

Theorem 1

Let P be a post task triggering a wait task W; \(P_{r}\) another post task on the same event; and \(P_{r} \nprec P\). A synchronization race exists between W and \(P_{r}\), iff \(W \nprec P_{r}\).

Proof

According to Definition 1, pt.2 we try to construct a consistent frontier such that W is active and both P and \(P_{r}\) are located before the frontier.

\(\Rightarrow \): Since W is active, it lies after the frontier. If \(W {\prec }P_{r}\), then \(P_{r}\) lies after the frontier too. Thus it is not possible to construct a consistent frontier with \(P_{r}\) being located before the frontier. The conditions of Definition 1 can’t be met.

\(\Leftarrow \): Let \(Next(P_{r})\) be the event immediately following \(P_{r}\). We place the frontier between \(P_{r}\) and \(Next(P_{r})\), so that any wait triggered by \(P_{r}\) is after the frontier. Furthermore we place the frontier so that W is active. This step requires no shift of the already placed frontier, since \(W \nprec P_{r}\). If P is already before the frontier, the conditions of Definition 1 are met: W is active, P and \(P_{r}\) lie before the frontier and are not connected to a wait before the frontier. Otherwise we place the frontier so that P lies before it. Again, this step requires no shift of already placed frontiers to preserve consistency: \(W \nprec P\) since P triggers W, but also \(P_{r} \nprec P\) by assumption. Thus the conditions of Definition 1 are met again. \(\square \)

Definition 1 requires that the sequence of wait operations on a particular event is totally ordered in a race-free task graph. Theorem 1 reveals how we can check this property: whenever a post task P is encountered it is checked against the last wait task W on the same event that has been triggered. If \(W \nprec P\) then a synchronization race has been found.

We can also prove, that Definition 1 is feasible to identify nondeterminism in a program execution.

Theorem 2

If a program execution \(\mathcal {P}\) has no synchronization races, then \(\mathcal {P}\) is deterministic.

Proof

We assume a program execution \(\mathcal {P}= \langle E, {\prec }\rangle \) free of synchronization races. If \(\mathcal {P}\) is non-deterministic, then another execution \(\dot{\mathcal {P}} = \langle \dot{E}, \dot{\prec }\rangle \) with the same input could exhibit the same synchronization events and relations up to some point, after which they differ. Let W be the first wait event at which \(\mathcal {P}\) and \(\dot{\mathcal {P}}\) differ. We distinguish two cases:

1.
Let \(P_{1}\) and \(P_{2}\) be different post events, which trigger W in \(\mathcal {P}\) and \(\dot{\mathcal {P}}\) respectively. Then \(W \nprec P_{1}\), since \(P_{1}\) triggers W in \(\mathcal {P}\). In addition \(W \nprec P_{2}\) in \(\mathcal {P}\), since \(P_{2}\) triggers W in \(\dot{\mathcal {P}}\) and all events and relations before W are the same in \(\mathcal {P}\) and \(\dot{\mathcal {P}}\). Hence we can construct a consistent frontier in \(\mathcal {P}\), such that W is active and \(P_{1}\) and \(P_{2}\) are both before the frontier. W.l.o.g. we assume \(P_{1} \nprec P_{2}\) in \(\mathcal {P}\), since \(P_{1} {\prec }P_{2} \wedge P_{2} {\prec }P_{1}\) cannot hold. Then the conditions of Theorem 1 are met with \(P=P_{2}\) and \(P_{r}=P_{1}\). But this contradicts the initial assumption, that \(\mathcal {P}\) is free of synchronization races.
2.
W.l.o.g. we assume that W is not triggered in \(\mathcal {P}\), but triggered in \(\dot{\mathcal {P}}\) by P. Then there is a task \(W_{x}\), which has cleared the event posted by P before W in \(\mathcal {P}\). Thus \(W \nprec W_{x}\) in \(\mathcal {P}\), since \(W_{x}\) is executed, but W is not triggered. If \(W_{x} {\prec }W\) in \(\mathcal {P}\), then \(W_{x}\) would be included in the set of events, which are the same in \(\mathcal {P}\) and \(\dot{\mathcal {P}}\). Then \(W_{x} {\prec }W\) in \(\dot{\mathcal {P}}\) and \(W_{x}\) would be triggered in \(\dot{\mathcal {P}}\) by P. But P has triggered W in \(\dot{\mathcal {P}}\) too, which is not possible if \(W_{x} {\prec }W\). Thus \(W_{x} \nprec W\) in \(\mathcal {P}\). The conditions of Definition 1, pt.1 are met. Again this contradicts the initial assumption, that \(\mathcal {P}\) is free of synchronization races. \(\square \)

Theorem 2 is literally taken from [12]. We have adapted the proof to our model and extended it in order to deal with the possibility of concurrent wait tasks in hybrid parallel programs. Theorem 2 implies, that exactly one resulting task graph \(\mathcal {P}\) exists for a race-free input trace \(\mathcal {P}^{T}\). Moreover, no race-free task graph \(\mathcal {P}\) can exist for an input trace containing synchronization races.

Unlike other non-determinacies we consider synchronization non-determinacy always a programming error. In the case covered by Theorem 1 both P and \(P_{r}\) might be executed before W. As a result one of these post events is lost, a subsequent wait might never trigger, and at least one thread never finishes. But even in the case, that superfluous post events prevent such a kind of deadlock, no reliable happens-before relation is established. We only have \(P {\prec }W \vee P_{r} {\prec }W\), but this also means, that anyone of P or \(P_{r}\) may happen after W. This behavior contradicts the notion of point-to-point synchronization, whose purpose is to create happens-before relations.

4 The Replay Algorithm

The following algorithm to analyze synchronization operations is based on a replay approach. It performs a mock-up execution of the traced input tasks. Due to Theorem 2 our algorithm can replay the tasks in any order, which preserves the semantics of the synchronization primitives. During the replay the algorithm checks for the occurrence of synchronization races according to Theorem 1. If no races are found, the result is a race-free task graph \(\mathcal {P}\). This graph contains all happens-before relations induced by the traced synchronization primitives.

Listing 1 is a condensed version of our actual implementation, which demonstrates the unified handling of blocking collective and point-to-point operations. The function replay_tasks replays the traced tasks of one thread consecutively until there are no more traced events or an untriggered wait is encountered. Depending on the type of the processed task T the variable e (line 3) denotes the flag number (point-to-point operation), the process group (GASPI collective) or the thread team (OpenMP barrier). Also depending on the type of T the index r (line 4) denotes the particular position of the thread of T inside e. This index is always 0 for point-to-point operations, it refers to a process index for a GASPI collective, and to the thread index for an OpenMP barrier. Every event is assigned a data structure PWP. PWP.Wait stores the active wait task, PWP.PreviousWait stores the last wait task that has been triggered. PWP.Post stores an already replayed post task, which hasn’t been connected to one or more wait tasks yet. Race conditions are checked at line 8 (Definition 1, pt.2), at line 10 (Theorem 1) and at line 14 (Definition 1, pt.1). The lines 18–26 handle triggered wait tasks. If all members of a synchronization operation (a point-to-point operation has only one member) have set their respective events to posted, then a happens-before relation is added from the respective post tasks to all active wait tasks (line 20). At line 25 the execution of formerly suspended threads is resumed. If the current task is a wait task, then the thread is suspended at line 27. Note however, that by this time the thread might be already further processed at line 25. If the current task is a post task, then the replay of the thread just proceeds (line 28). The handling of non-blocking collectives is omitted for brevity. They require a special handling, since it is not possible to wait until all wait tasks of such a collective are encountered (line 18).

The performance-critical part of our algorithm is the reachability test at line 10, which we have implemented using depth-first-search (DFS). Therefore the complexity of the algorithm is \(\mathcal {O}(|T|^{2})\) with |T| being the total number of tasks. However we have optimized the reachability test by leveraging the fact, that the replay order of the tasks is topological sorted. Albeit the worst case complexity would remain \(\mathcal {O}(|T|^{2})\), in practice large portions of the search space are cut off reducing the complexity of our replay algorithm to sub-quadratic time. In addition, the topological sorting helps in further analysis tasks (e.g. data race detection), which perform reachability tests too.

Since the replay order of tasks doesn’t matter due to Theorem 2, the algorithm can be easily parallelized. The function replay_tasks can be executed in parallel for tasks of multiple threads. The access to the PWP map must be synchronized. Instead of the recursive call at line 27 a queue should be used, from which analysis threads fetch tasks, which are ready to be replayed.

5 Practical Evaluation

We have implemented the replay algorithm in a tool capable of analyzing post-mortem execution traces of hybrid programs using GASPI on the process level and OpenMP/Pthreads at the thread level. The tool combines this work with the model introduced in [8] in order to obtain task graphs of GASPI programs. The execution traces are generated by recording function enter and function leave events, their respective arguments, and return values using the dynamic binary instrumentation framework Pin [11]. Thus, the analysis doesn’t require a recompilation of the source code.

With our replay algorithm we are able to generate a task graph of a GASPI program run out of an execution trace. Since such a task graph contains the happens-before relations in terms of necessity, it reveals the logic connections among the threads. As such, our algorithm opens up a complete new perspective to a parallel program. A programmer can visualize, understand and also easily teach the interactions of the asynchronous weak synchronization operations exhibited by a GASPI program.

In the following figures the time line is top-down and ranks are ordered from left to right (starting with rank 0). Collective synchronization is not visualized for clarity. Figure 4 depicts a detail of a task graph visualizing an one-sided broadcast implemented as a binary tree. Rank 0 sends the data to Rank 1,2,4, and 8 via the asynchronous one-sided gaspi_write_notify function. After rank 2,4, and 8 have received the data, they redistribute it.

Figure 5 shows two iterations of an one-dimensional halo-exchange code in a ring of 4 processes. The code uses double-buffering and switches back and forth between two data segments. A particular event e is defined by its rank r, its segment s and its flag number f. The notify_reset nodes enclosed in the two dotted rectangles are a case of static non-determinism. In the first iteration rank 0 receives its data first from rank 3 and then from rank 1. In the second iteration the receiving order changes, now rank 0 receives its data first from rank 1 and from rank 3 afterward. The dashed red line marks the happens-before relation between a wait operation (notify_reset) and a subsequent asynchronous post operation (issued by write_notify) on the same event (rank 1, segment 0, flag number 0). Thus the requirement imposed by Theorem 1 holds. During the construction of the task graph the replay algorithm has checked this requirement for all post/wait chains on all events. The program run doesn’t contain any synchronization races. Thus, while the program itself is statically non-deterministic, the analyzed program run doesn’t contain any problematic non-determinacies with respect to Theorem 2.

The examination of the complexity of the replay algorithm is shown in Fig. 6 for the two applications described above. The diagrams depict the number of node visits (#VISITS) performed by the DFS in relation to the number of replayed synchronization tasks (#TASKS). For both use-cases the topological sorting results in a linear complexity with respect to #TASKS. The gradient increases with the number of threads |t|. The halo-exchange code doesn’t contain a collective operation in its main computational loop. The complexity is \(\mathcal {O}(|T|*|t|)\). The binary broadcast code performs number of collective operations. The additional edges thus introduced raise the complexity to \(\mathcal {O}(|T|*|t|^2)\). However, the influence of |t| can be mitigated by the already outlined parallelization, since more threads allow more tasks to be replayed in parallel.

Figure 7 depicts the task graph of a program, which sometimes got stuck. The analyzed execution trace was recorded of a successful program run. Nevertheless, our replay algorithm revealed a post/post collision and marked the corresponding nodes in the output graph. The problem was introduced by a program optimization, where a collective reduction in the initialization phase was replaced by a more efficient binary broadcast routine similar to the one depicted in Fig. 4. That routine was taken from another program, where it had worked. The problem was, that the flag range used by the initialization routine overlapped with the flag range of the worker phase. The first marked notify_reset node at rank 3 was meant to receive the notification from rank 0. However, it could also receive a notification from rank 2, which already belongs to the worker phase. In such a case our tool marks the colliding notifications and connects them with the notify_reset node. If the replay algorithm finishes, it also marks still untriggered wait operations, e.g. the second notify_reset node at rank 3. With this information we could fix the bug by assigning different numbers to the flags of the initialization routine. This removed the overlap and freed the program of synchronization races.

6 Related Work

Race conditions are difficult to detect due to their irreproducible characteristics. Hence research on synchronization and concurrency has always been an important aspect for the HPC community. However, according to our knowledge no work has proposed a combined approach for the analysis of point-to-point and collective synchronization yet.

The problem of barrier matching has been studied for message-passing systems [18], PGAS systems [16] and shared-memory systems [10]. The analysis of split-phase barriers and data race detection has been combined in [15] for UPC programs. The problem of computing all guaranteed orderings in a program trace using the POST, WAIT, CLEAR model is NP-hard [14]. Two algorithms have been proposed to solve this problem [5]. The closest common ancestor algorithm works in polynomial time, but may miss out on some of the guaranteed ordering. The exhaustive-pairing algorithm computes the ordering accurately, but works in exponential time. For programs without CLEAR operations it is possible to construct algorithms with \(\mathcal {O}(np)\) complexity where n is the number of events and p is the number of processes [13, 17]. However, the relinquishment of the CLEAR operation entails the problem that events are not reusable. A discussion about the consequences of the CLEAR operation can be found in [2].

An efficient algorithm to locate synchronization errors in pure MPI programs is described in [12]. While this approach does not handle collective synchronization, the theoretical background presented there is similar to our approach. Theorem 2 appears in our work in a more generalized context.

7 Conclusion

This paper makes two important contributions. First, we have extended the event-style synchronization model to collectives. By doing so we are able to handle point-to-point and collective synchronization in a unified manner. This enables us to reason about the execution order of events in hybrid parallel programs. Second, we have condensed the event-style synchronization paradigm to two operations – post and wait. Our wait operation is a concatenation of the classic WAIT and CLEAR operations. This simplification has the important effect, that task graph construction is not NP-hard any more. Thus programs using our synchronization paradigm are testable for race conditions of various kinds. Our paradigm is used by the collective and the point-to-point synchronization routines defined by the GASPI standard. Even MPI_Send and MPI_Recv can be regarded as post and wait respectively.

Our model does not require an atomic coupling of WAIT and CLEAR. A programmer could also manually perform a CLEAR after a WAIT. For instance, the OpenShmem function shmem_int_wait forces a thread to wait until an integer is no longer equal to a certain value. One could reset the respective integer to that value as soon as shmem_int_wait has returned and thus achieve the functionality required by our model. A function shmem_int_wait_and_clear would lead to programs, which are implicitly testable against race conditions. That’s why we think, that our work should be considered, whenever decisions have to be made during the design of parallel programming APIs. Point-to-point synchronization using post and wait makes the reasoning about the correctness of programs easier than the POST, WAIT, CLEAR paradigm.

While we have introduced the algorithm in the context of post-mortem analysis, an adaption to an on-the-fly approach is possible. As discussed in Sect. 3 tasks can be replayed in the order of their delivery. On-the-fly techniques can cope with much longer program runs than post-mortem techniques, since tasks can be discarded once they are evaluated. An interesting research topic is the question, which tasks can be discarded so that the test of Theorem 1 (PWP[r].PreviousWait \({\prec }\) T) on line 10 of Listing 1 is not affected.

References

GASPI - Global Adress Space Programming Interface (2011). http://www.gaspi.de. Accessed 29 Jan 2015
Callahan, D., Kennedy, K., Subhlok, J.: Analysis of event synchronization in a parallel programming tool. In: ACM SIGPLAN Notices, vol. 25, pp. 21–30. ACM (1990)
Google Scholar
Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. (TOCS) 3(1), 63–75 (1985)
Article Google Scholar
El-Ghazawi, T., Carlson, W., Sterling, T., Yelick, K.: UPC: Distributed Shared Memory Programming. Wiley Series on Parallel and Distributed Computing. Wiley, New York (2005)
Book Google Scholar
Emrath, P., Ghosh, S., Padua, D.: Detecting nondeterminacy in parallel programs. IEEE Softw. 9(1), 69–77 (1992)
Article Google Scholar
Emrath, P.A., Ghosh, S., Padua, D.A.: Event synchronization analysis for debugging parallel programs. In: Proceedings of Supercomputing 1989, pp. 580–588 (1989)
Google Scholar
Grünewald, D., Simmendinger, C.: The GASPI API specification and its implementation GPI 2.0. In: 7th International Conference on PGAS Programming Models, vol. 243 (2013)
Google Scholar
Krzikalla, O., Knüpfer, A., Müller-Pfefferkorn, R., Nagel, W.E.: On the modelling of one-sided communication systems. In: Proceedings of the 7th International Conference on PGAS Programming Models, Edinburgh, UK, October 2013, pp. 41–53 (2013)
Google Scholar
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21(7), 558–565 (1978)
Article MATH Google Scholar
Lin, Y.: Static nonconcurrency analysis of OpenMP programs. In: Mueller, M.S., Chapman, B.M., de Supinski, B.R., Malony, A.D., Voss, M. (eds.) IWOMP 2005 and IWOMP 2006. LNCS, vol. 4315, pp. 36–50. Springer, Heidelberg (2008)
Chapter Google Scholar
Luk, C.K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumentation. SIGPLAN Not. 40(6), 190–200 (2005)
Article Google Scholar
Netzer, R.H., Brennan, T.W., Damodaran-Kamal, S.K.: Debugging race conditions in message-passing programs. In: Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools, pp. 31–40. ACM (1996)
Google Scholar
Netzer, R.H., Ghosh, S., et al.: Efficient race condition detection for shared-memory programs with Post/Wait synchronization. University of Wisconsin-Madison, Computer Sciences Department (1992)
Google Scholar
Netzer, R.H., Miller, B.P.: On the complexity of event ordering for shared-memory parallel program executions. In: Proceedings of the 1990 International Conference on Parallel Processing, pp. 93–97 (1990)
Google Scholar
Park, C.S., Sen, K., Hargrove, P., Iancu, C.: Efficient data race detection for distributed memory parallel programs. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 51:1–51:12. ACM, New York (2011)
Google Scholar
Pophale, S., Hernandez, O., Poole, S., Chapman, B.M.: Extending the OpenSHMEM analyzer to perform synchronization and multi-valued analysis. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 134–148. Springer, Heidelberg (2014)
Google Scholar
Ramanujam, J., Mathew, A.: Analysis of event synchronization in parallel programs. In: Pingali, K.K., Gelernter, D., Padua, D.A., Banerjee, U., Nicolau, A. (eds.) LCPC 1994. LNCS, vol. 892, pp. 300–315. Springer, Heidelberg (1995)
Chapter Google Scholar
Zhang, Y., Duesterwald, E.: Barrier matching for programs with textually unaligned barriers. In: Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 194–204. ACM (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Technische Universität, Dresden, Germany
Olaf Krzikalla, Ralph Müller-Pfefferkorn & Wolfgang E. Nagel

Authors

Olaf Krzikalla
View author publications
You can also search for this author in PubMed Google Scholar
Ralph Müller-Pfefferkorn
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang E. Nagel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Olaf Krzikalla .

Editor information

Editors and Affiliations

Université Grenoble-Alpes , Grenoble, France
Pierre-François Dutot
Université Grenoble Alpes , Grenoble, France
Denis Trystram

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Krzikalla, O., Müller-Pfefferkorn, R., Nagel, W.E. (2016). Synchronization Debugging of Hybrid Parallel Programs. In: Dutot, PF., Trystram, D. (eds) Euro-Par 2016: Parallel Processing. Euro-Par 2016. Lecture Notes in Computer Science(), vol 9833. Springer, Cham. https://doi.org/10.1007/978-3-319-43659-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-43659-3_3
Published: 09 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43658-6
Online ISBN: 978-3-319-43659-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Synchronization Debugging of Hybrid Parallel Programs

Abstract

Similar content being viewed by others

Optimized Sound and Complete Data Race Detection in Structured Parallel Programs

Thread-Local Semantics and Its Efficient Sequential Abstractions for Race-Free Programs

Static Race Detection for Periodic Real-Time Programs with IPCP Locks

Keywords

1 Introduction

2 Model

3 Synchronization Races

Definition 1

Theorem 1

Proof

Theorem 2

Proof

4 The Replay Algorithm

5 Practical Evaluation

6 Related Work

7 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us