Analyzing fault behavior of shared data in parallel applications

https://doi.org/10.1016/j.micpro.2016.03.014Get rights and content

Abstract

Multicore architectures are becoming the most promising computing platforms thanks to their high performance. The soft error rate in multicore systems increases by the trend in the transistor sizes and the reduction of the voltage of the transistors. Evaluating the impact of soft errors on parallel applications is critical to understand the fault characteristics and to decide the fault tolerance strategies for the reliable execution. In this paper, we examine the soft error vulnerabilities of shared data in parallel Java applications. To analyze fault behavior of shared data in parallel programs, we design and implement a bytecode instrumentation based analysis and fault injection framework. We evaluate the fault behavior of shared data fields on a set of parallel applications from NAS benchmark suite. Our experimental evaluation demonstrates data type and access characteristics of the shared fields, and shows that shared data structures of parallel applications are more vulnerable to soft errors. While error rates for unshared local data stay around 20% in our target applications, the rate for shared data exceeds above 30% for some applications. We further discuss potential directions of our results and how shared data analysis can be employed to apply partial fault tolerance techniques.

Introduction

Multicore architectures yield high performance by the execution of multiple concurrent software processes through parallel execution units in a power and area efficient way [40]. Replacing a large and complex processor with several small and simple processors in a single chip has been proven successful in these architectures. While, the performance increases by the simultaneous execution of multiple cores in a low communication cost shared memory environment, the reliability becomes an important concern due to the high error vulnerability of multicore architectures. Since multicore systems require the placement of many smaller size transistors in a single chip, they become more error prone [41].

Soft errors are transient errors resulting in bit flips in memory and errors in logic circuit output [54]. They can result from particle strikes, cosmic rays, electrical noise, and other environmental effects [24], [49]. The soft error rate in multicore systems increases by the trend in the transistor sizes and the reduction of the voltage of the transistors [49]. The impact of soft errors on an application running on a multicore architecture may be different for different scenarios [54]. First, the application continues its normal execution without any effect on the resulting output. If the corrupted value by the soft error is overwritten by a subsequent write operation before any read operation, the execution does not notice the error. Additionally, the faulty bit may not be used in the future execution of the application. Since the bit flip occurs in an unused memory location or the bit is a un-ACE bit [37], which has no effect on the program output; the application output is the same as the fault free execution. Second, the program may be crashed with an exception and exit with an error code. For instance, the memory access operation raises a segmentation fault if the memory address is corrupted by the soft error and the operation cannot physically addressed by the CPU. Third, the application execution may be hanged in an infinite loop due to the erroneous value of loop iteration variable. Finally, the resulting application output may be erroneous even if the application terminates normally. The soft error corrupts the value used for the computation of the result, and the application results in output error silently (silent data corruption-SDC).

For the shared-memory architectures, the effect of soft errors may become more serious due to the propagation of faults over the shared data structures in the parallel applications [18]. Fig. 1 presents CG benchmark calculation deviations for each thread after one bit flip of a shared field in the first iteration. CG, which is a NAS parallel benchmark program [14], calculates an estimation for the smallest eigenvalues of a sparse matrix with conjugate gradient method. Its smallest input has 15 main iterations for the calculation, and array structures are updated through those iterations by several threads. If the value of an element, that is computed by one thread, is corrupted in the first iteration (as the Thread2 calculation in the figure), the calculation of the other threads also fails in the following iterations. In the figure, each connected part represents the deviation from the correct calculation for four threads (as labeled in the second iteration, T1 point indicates Thread1 deviation, T2 point indicates Thread2 deviation, and so on). In the first iteration, there is a large deviation in the intermediate value calculated by Thread2, which is the thread executed during fault injection. The other threads, Thread1, Thread3, and Thread4, perform calculations correctly at the end of the first iteration; but for the following iterations, all threads’ calculations deviate from their expected result. One bit flip on a shared field propagates to all thread calculations just after one iteration. The figure demonstrates the importance of the shared data in a parallel application, as it is the cause of the fault propagation in the execution.

In parallel programs, access to shared fields happens much more frequently than other local fields [38]. Since more than one thread (the same code segment but different execution) works with the shared data, the number of accesses is larger even shared data is small (see Fig. 8). For instance, most of the threads frequently read the variables using synchronization primitives since they require the status of these variables to achieve the synchronized execution.

Parallel program threads access to the shared data in different ways. One scenario, RAW (read-after-write) dependency, occurs when one thread computes data, and the others read this value. For instance, one thread (Thread1) computes one element of the array, and the other two threads (Thread2 and Thread3) use this value in their computation. If the computation of the value by Thread1 is corrupted, and the array element is updated with the incorrect value, this corrupted value propagates to Thread2 and Thread3 as well. Similarly, in another scenario, the threads can share data over synchronization variables. For instance, the master thread sets the synchronization variable when the execution reaches some point, and this variable is checked repeatedly by the worker threads in the application to ensure correct execution order. When the master thread sets the value of the variable, all other threads can continue their execution. If the variable is set by a soft error while it should have unset value, the worker threads continue their execution and they will probably produce incorrect results due to incorrect synchronization. The other case, in which the variable is unset by a soft error after the master thread sets it, but before one or more worker threads do not read it, also may end in undesirable execution. Since the workers need to read the set value to continue their execution, the corrupted unset value causes their hang, i.e., they never end their execution. As another scenario, all application threads consume a common data in a memory location. A soft error on the memory location, where the data resides, causes the corruption of the computations performed by all threads. As a result, the final application output probably turns out to be corruption.

Characterizing the impact of soft errors on the applications plays a critical role to understand the fault behavior of the applications and to decide the fault tolerance strategies for the reliable execution. Fault injection, which is a dependability validation technique based on the controlled experiments introducing faults into the system, allows us to evaluate the impact of soft errors on the applications [11]. Fault injection experiments have been used for several areas to assess the fault behavior of the hardware and software systems [7], [15], [17], [27], [28], [30], [31], [42], [52]. By using fault injection-based fault analysis, one can not only evaluate the vulnerability of the overall execution but also distinguish the most vulnerable parts of the application. Since full redundancy-based fault tolerance techniques may impose impractical costs, fault tolerant system designers may use this information to guide the partial redundancy techniques.

In this paper, we examine the soft error vulnerabilities of parallel Java applications running on multicore architectures. To analyze fault behavior of shared data in parallel programs, we design and implement a high-level bytecode instrumentation based framework for shared data analysis and fault injection experiments. Our approach differs from earlier approaches in that we observe multithreaded applications to find out shared data structures, which are critical for parallel application state [33], and examine the fault behavior of the applications in case of a soft error on these shared data fields. We can summarize the main contributions of this work as follows:

  • We design and implement an instrumentation framework for parallel Java applications. Our high-level framework includes both a shared data analysis component to determine faulty structures, and a fault injection framework, which implements a single-bit fault model for Java applications. The framework is based on Java language instrumentation (java.lang.Instrument package) and uses portable Javassist library with no source code requirement for the target applications being analyzed.

  • We execute a set of parallel Java applications from NAS [14] benchmark suite on our framework. After determining the fault parameters by analyzing the application threads, we perform fault injection experiments to evaluate the fault behavior of the target applications.

  • We present a detailed experimental analysis and evaluation results for shared Java objects. Our analysis includes field distributions, data type distributions, and access frequencies of Java fields. We also present fault injection results and discuss the effect of soft errors on the target applications by considering shared data structures.

  • Our experimental evaluation shows that shared data structures of parallel applications are more vulnerable to soft errors. While error rates for unshared local data stay around 20% in our target applications, the rate for shared data exceeds above 30% for some applications.

  • We present our observations and potential directions of our results. Our discussion reveals that partial fault tolerance techniques can make use of the shared data vulnerability analysis, and apply selective redundancy for only those more vulnerable structures.

The remainder of this paper is organized as follows. Section 2 explains the system model used in our work. Our bytecode instrumentation based framework details are presented in Section 3. The experimental analysis including experimental setup and results is given in Section 4; and it is followed by Section 5, where we present our observations and discussion about fault behavior of parallel Java applications. Section 6 presents the related work on fault injection based vulnerability analysis in the literature. We conclude the paper in Section 7.

Section snippets

Soft error outcomes

Soft errors are transient transitions of single-bit values due to external factors such as particle strikes, electrical noise, and cosmic rays [49], [54]. Recently, multi-bit transient faults have increased due to reduced dimensions and voltage scaling, resulting in multiple adjacent bits [19], [26], [55]. A soft error may impact the applications in various ways. In our fault analysis work, we consider the following possible outcomes:

  • Correct Execution: The application terminates successfully

Background:Java instrumentation

Java language instrumentation (java.lang.Instrument package), which is based on bytecode instrumentation, enables the programmers to instrument programs running on the JVM [1]. When the JVM is initialized, the system loader loads the agent class. Afterwards, the premain method in the agent class is invoked just before the main method execution of the target application class. Fig. 2 presents these components of Java Instrumentation as an execution flow.

To alter bytecode of class files, the agent

Experimental analysis

In this section, we present our experimental setup used in our evaluations, and demonstrate the shared data extraction and fault injection results obtained from our experiments.

Observations and discussion

Redundancy, as a fault tolerance technique, is the replication of hardware and/or software components of a system by targeting to increase reliability [44], [45], [46], [47]. Since, the redundancy causes performance degradation and resource consumption, the replication of whole program/data may not be efficient and practical. Therefore, partial redundancy techniques based on the selective replication have been proposed for higher performance with acceptable reliability [3], [6], [13], [15], [20]

Related work

There have been numerous fault injection studies based on different methodologies in the literature [4], [8], [22], [29], [52]. While low-level hardware techniques require additional hardware to introduce faults into the system, higher level techniques relied on software components target applications and operating systems without any extra hardware requirement.

BIFIT presents a binary instrumentation based empirical fault injection tool to analyze the soft error behavior of extreme-scale

Conclusions

In this paper, we examine the soft error vulnerabilities of parallel Java applications running on multicore architectures. We design and implement a bytecode instrumentation based framework for shared data analysis and fault injection on parallel applications. Our analysis evaluates the target application data fields by considering their sharing among application threads, and conducts fault injection experiments based on shared data information. Our experimental evaluation indicates that shared

Acknowledgment

This work was supported by Research Fund of the Marmara University. Project Number: FEN-A-100914-0333.

Isil Oz received the B.Sc. and M.Sc. degrees in computer engineering from Marmara University, Istanbul in 2004 and 2008, respectively. She received the Ph.D. degree in computer engineering at Bogazici University, Istanbul in 2013. Her research interests include computer architecture, performance and reliability of multicore systems, and fault-tolerant computing.

References (55)

  • S. Ghosh et al.

    Bytecode fault injection for java software

    J. Syst. Softw.

    (2008)
  • Api specification for package java.lang.instrument, 2015. Available at:...
  • Java virtual machine tool interface (jvm ti), 2015. Available at:...
  • L. Anghel et al.

    Evaluation of a soft error tolerance technique based on time and/orspace redundancy

    Symposium on Integrated Circuits and Systems Design

    (2000)
  • J. Arlat et al.

    Fault injection for dependability validation: a methodology and some applications

    IEEE Trans. Softw. Eng.

    (1990)
  • D.M. Blough et al.

    Fimd-mpi: a tool for injecting faults into mpi applications

    International Parallel and Distributed Processing Symposium (IPDPS)

    (2000)
  • D. Borodin et al.

    Protective redundancy overhead reduction using instruction vulnerability factor

    Computing Frontier (CF)

    (2010)
  • J. Carreira et al.

    Xception: a technique for the experimental evaluation of dependability in modern computers

    IEEE Trans. Softw. Eng.

    (1998)
  • R. Chandra et al.

    A global-state-triggered fault injector for distributed system evaluation

    IEEE Trans. Parallel Distrib. Syst.

    (2004)
  • S. Chiba

    Load-time structural reflection in java

    European Conference on Object-Oriented Programming (ECOOP)

    (2000)
  • S. Chiba et al.

    An easy-to-use toolkit for efficient java bytecode translators

    International Conference on Generative Programming and Component Engineering (GPCE)

    (2003)
  • J.A. Clark et al.

    Fault injection: a method for validating computer-system dependability

    J. Comput.

    (1995)
  • T.J. Dell

    A white paper on the benefits of chipkill-correct ecc for pc server main memory

    IBM Microelectronics Division

    (1997)
  • S. Feng et al.

    Shoestring: probabilistic soft error reliability on the cheap

    Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems

    (2010)
  • M.A. Frumkin et al.

    Implementation of the nas parallel benchmarks in java

    Technical Report, NAS-02-009, Ames Research Center, Moffett Field, CA, USA

    (2002)
  • P. Gawkowski et al.

    Improving fault handling software techniques

    IEEE On-Line Testing Symposium

    (2010)
  • C. Giuffrida et al.

    Edfi: a dependable fault injection tool for dependability benchmarking experiments

    IEEE 19th Pacific Rim International Symposium on Dependable Computing (PRDC)

    (2013)
  • B.T. Gold et al.

    The granularity of soft-error containment in shared-memory multiprocessors

    Workshop on System Effects of Logic Soft Errors (SELSE)

    (2006)
  • B.T. Gold et al.

    Mitigating multi-bit soft errors in l1 caches using last-store prediction

    Federated Comput. Res. Conf.

    (2007)
  • O. Goloubeva et al.

    Soft-error detection using control flow assertions

    International Symposium on Defect and Fault Tolerance in VLSI Systems

    (2003)
  • M.A. Gomaa et al.

    Opportunistic transient-fault detection

    Proceedings of International Symposium on Computer Architecture (ISCA)

    (2005)
  • S. Han et al.

    Doctor: an integrated software fault injection environment

    Computer Performance and Dependability Symposium

    (1995)
  • S.K.S. Hari et al.

    mswat: low-cost hardware fault detection and diagnosis for multicore systems

    Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

    (2009)
  • J.P. Hayes et al.

    An analysis framework for transient-error tolerance

    IEEE VLSI Test Symposium (VTS)

    (2007)
  • W. Hoarau et al.

    Fault injection in distributed java applications

    International Parallel and Distributed Processing Symposium (IPDPS)

    (2006)
  • E. Ibe et al.

    Impact of scaling on neutron-induced soft error in srams from a 250 nm to a 22 nm design rule

    IEEE Trans. Electron. Devices

    (2010)
  • G. Jacques-Silva et al.

    Fiona: a fault injector for dependability evaluation of java-based network applications

    International Symposium on Network Computing and Applications (NCA)

    (2004)
  • Cited by (0)

    Isil Oz received the B.Sc. and M.Sc. degrees in computer engineering from Marmara University, Istanbul in 2004 and 2008, respectively. She received the Ph.D. degree in computer engineering at Bogazici University, Istanbul in 2013. Her research interests include computer architecture, performance and reliability of multicore systems, and fault-tolerant computing.

    View full text