ABSTRACT
The trend of progressive technology scaling makes the computing system more susceptible to soft errors. The most critical issue that soft error incurs is silent data corruption (SDC) since SDC occurs silently without any warnings to users. Estimating SDC probability of a program is the first and essential step towards designing protection mechanism. Prior work suffers from prediction inaccuracy since the proposed heuristic-based models fail to describe the semantic of fault propagation. We propose a novel approach SLOGAN which transfers the prediction of SDC probability into a graph regression task. A program is represented in the form of dynamic dependence graph. To capture the rich semantic of fault propagation, we apply structured graph attention network, which includes node-level, graph-level and layer-level self-attention. With the learned attention coefficients from node-level, graph-level, and layer-level self-attention, the importance of edges, nodes, and layers to the fault propagation can be fully considered. We generate the graph embedding by weighted aggregation of the embeddings of nodes and compute the SDC probability by the regression model. The experiment shows that SLOGAN achieves higher SDC accuracy than state-of-the-art methods with a low time cost.
- Jha S, Cui S, and Tsai T. "Exploiting temporal data diversity for detecting safety-critical faults in AV compute systems," in IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 88--100, IEEE, 2022.Google Scholar
- Dixit H. D., Pendharkar S. and Beadon M.. "Silent data corruptions at scale," arXiv preprint arXiv:2102.11245, 2021.Google Scholar
- Chang C K, Li G and Erez M. "Evaluating compiler ir-level selective instruction duplication with realistic hardware errors," in 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), pp. 41--49, IEEE, 2019.Google Scholar
- Arasteh B, Najafi J. "Programming guidelines for improving software resiliency against soft-errors without performance overhead," Computing, vol. 100(9), pp. 971--1003, 2018.Google ScholarDigital Library
- Li Z, Menon H, and Mohror K. "Understanding a program's resiliency through error propagation," in ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 362--373, ACM, 2021.Google Scholar
- Pusz O, Christian D, and Daniel L. "Data-flow-sensitive fault-space pruning for the injection of transient hardware faults," in International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES), pp. 97--109, ACM, 2021.Google Scholar
- Fang B, Lu Q and Pattabiraman K. "ePVF: An enhanced program vulnerability factor methodology for cross-layer resilience analysis," in Dependable Systems and Networks (DSN), pp. 168--179, IEEE, 2016.Google Scholar
- Guo L, Li D and Laguna I. "Fliptracker: Understanding natural error resilience in HPC applications," in International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 94--107, IEEE, 2018.Google Scholar
- Guo L, Li D and Laguna I. "Paris: Predicting application resilience using machine learning," Journal of Parallel and Distributed Computing (JPDC), no. 152, pp. 111--124, 2021.Google ScholarCross Ref
- Lu Q, Pattabiraman K, and Gupta M S. "SDCTune: a model for predicting the SDC proneness of an application for configurable protection," in International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), pp. 1--10, ACM, 2014.Google Scholar
- Li G, Pattabiraman K and Hari S K S. "Modeling soft-error propagation in programs," in IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 27--38, IEEE, 2018.Google Scholar
- Sridharan V and Kaeli D R. "Eliminating microarchitectural dependency from architectural vulnerability," in International Symposium on High Performance Computer Architecture (HPCA), pp. 117--128, IEEE, 2009.Google Scholar
- Hamilton W L, Ying Z and Leskovec J. "Inductive representation learning on large graphs," in International Conference on Neural Information Processing Systems (NIPS), pp. 1025--1035, ACM, 2017.Google Scholar
- Li Y, Wang S and Nguyen T N. "Improving bug detection via context-based code representation learning and attention-based neural networks," in Proceedings of the ACM on Programming Languages (OOPSLA), pp. 1--30, ACM, 2019.Google Scholar
- Rahman M H, Shamji A and Guo S. "PEPPA-X: finding program test inputs to bound silent data corruption vulnerability in HPC applications," in International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1--13, ACM, 2021.Google Scholar
- Lattner C and Adve V. "LLVM: A compilation framework for lifelong program analysis & transformation," in International Symposium on Code Generation and Optimization (CGO), pp. 75--86, IEEE, 2004.Google Scholar
- Lu Q, Farahani M and Wei J. "LLFI: An intermediate code-level fault injection tool for hardware faults, " in International Conference on Software Quality, Reliability and Security (QRS), pp. 11--16, IEEE, 2015.Google Scholar
Index Terms
- SLOGAN: SDC Probability Estimation Using Structured Graph Attention Network
Recommendations
Sampling + DMR: practical and low-overhead permanent fault detection
ISCA '11With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes ...
On the Multichromatic Number of s-Stable Kneser Graphs
For positive integers n and s, a subset Sï [n] is s-stable if sï |i-j|ï n-s for distinct i,j∈S . The s-stable r-uniform Kneser hypergraph KGrn,ks-stable is the r-uniform hypergraph that has the collection of all s-stable k-element subsets of [n] as ...
Adjacent vertex-distinguishing edge and total chromatic numbers of hypercubes
An adjacent vertex-distinguishing edge coloring of a simple graph G is a proper edge coloring of G such that incident edge sets of any two adjacent vertices are assigned different sets of colors. A total coloring of a graph G is a coloring of both the ...
Comments