ABSTRACT
Relentless technology scaling has made transistors more vulnerable to soft, or transient, errors. To keep systems robust against these, current error detection techniques use different types of redundancy at the hardware or the software level. A consequence of these additional protection mechanisms is that these systems tend to become slower. In particular, software error-detection techniques degrade performance considerably, limiting their uptake.
This paper focuses on software redundant multi-threading error detection, a compiler-based technique that makes use of redundant cores within a multi-core system to perform error checking. Implementations of this scheme feature two threads that execute almost the same code: the main thread runs the original code and the checker thread executes code to verify the correctness of the original. The main thread communicates the values that require checking to the checker thread to use in its comparisons.
We identify a major performance bottleneck in existing schemes: poorly performing inter-core communication and the generated code associated with it. Our study shows this is a major performance impediment within existing techniques since the two threads require extremely fine-grained communication, on the order of every few instructions. We alleviate this bottleneck with a series of code generation optimisations at the compiler level. We propose COMET (Communication-Optimised Multi-threaded Error-detection Technique), which improves performance across the NAS parallel benchmarks by 31.4% (on average) compared to the state-of-the-art, without affecting fault-coverage.
- GCC: GNU Compiler Collection. http://gcc.gnu.org.Google Scholar
- The LLVM Compiler Infrastructure. http://llvm.org.Google Scholar
- NAS Parallel Benchmarks. http://www.nas.nasa.gov/publications/npb.html.Google Scholar
- PERF: Linux Profiling With Performance Counters. https://perf.wiki.kernel.org.Google Scholar
- D. Bernick, B. Bruckert, P. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen. NonStop Advanced Architecture. In DSN 2005. Google ScholarDigital Library
- J. Chang, G. Reis, and D. August. Automatic Instruction-Level Software-Only Recovery. In DSN 2006. Google ScholarDigital Library
- C. Constantinescu. Trends and Challenges in VLSI Circuit Reliability. IEEE Micro 2003. Google ScholarDigital Library
- M. L. Fair, C. R. Conklin, S. Swaney, P. Meaney, W. Clarke, L. Alves, I. N. Modi, F. Freier, W. Fischer, and N. E. Weber. Reliability, Availability, and Serviceability (RAS) of the IBM eServer Z990. IBM Journal of Research and Development 2004. Google ScholarDigital Library
- S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: Probabilistic Soft Error Reliability on the Cheap. In ASPLOS 2010. Google ScholarDigital Library
- K. Gharachorloo and P. B. Gibbons. Detecting Violations of Sequential Consistency. In Proceedings of SPAA 1991. Google ScholarDigital Library
- W.-M. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, et al. The Superblock: An Effective Technique for VLIW and Superscalar Compilation. the Journal of Supercomputing 1993. Google ScholarDigital Library
- T. B. Jablin, Y. Zhang, J. A. Jablin, J. Huang, H. Kim, and D. I. August. Liberty Queues for EPIC Architectures. In Proceedings of EPIC Workshop 2010.Google Scholar
- L. Lamport. Specifying Concurrent Program Modules. TOPLAS 1983. Google ScholarDigital Library
- P. P. Lee, T. Bu, and G. Chandranmenon. A Lock-Free, Cache-Efficient Shared Ring Buffer for Multi-Core Architectures. In ANCS 2009. Google ScholarDigital Library
- P. P. Lee, T. Bu, and G. Chandranmenon. A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring. In IPDPS 2010.Google Scholar
- P. G. Lowney, S. M. Freudenberger, T. J. Karzes, W. Lichtenstein, R. P. Nix, J. S. O'donnell, and J. C. Ruttenberg. The Multiflow Trace Scheduling Compiler. The journal of Supercomputing, 1993. Google ScholarDigital Library
- S. A. Mahlke, W. Y. Chen, W.-m. W. Hwu, B. R. Rau, and M. S. Schlansker. Sentinel Scheduling for VLIW and Superscalar Processors. In ASPLOS 1992. Google ScholarDigital Library
- S. Michalak, K. Harris, N. Hengartner, B. Takala, and S. Wender. Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory's ASC Q Supercomputer. IEEE Transactions on Device and Materials Reliability 2005.Google Scholar
- K. Mitropoulou, V. Porpodas, and M. Cintra. DRIFT: Decoupled compileR-based Instruction-level Fault-Tolerance. In LCPC 2013.Google Scholar
- K. Mitropoulou, V. Porpodas, X. Zhang, and T. M. Jones. Lynx: Using OS and Hardware Support for Fast Fine-Grained Inter-Core Communication. In ICS 2016. Google ScholarDigital Library
- S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed Design and Evaluation of Redundant Multithreading Alternatives. In ISCA 2002. Google ScholarDigital Library
- N. Oh, P. Shirvani, and E. McCluskey. Error Detection by Duplicated Instructions in Super-scalar Processors. IEEE Transactions on Reliability 2002. Google ScholarDigital Library
- S. K. Reinhardt and S. S. Mukherjee. Transient Fault Detection via Simultaneous Multithreading. In ISCA 2000. Google ScholarDigital Library
- G. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. August. SWIFT: Software Implemented Fault Tolerance. In CGO 2005. Google ScholarDigital Library
- E. Rotenberg. AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors. In International Symposium on Fault-Tolerant Computing 1999. Google ScholarDigital Library
- P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi. Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic. In DSN 2002. Google ScholarDigital Library
- A. Shye, T. Moseley, V. Reddi, J. Blomstedt, and D. Connors. Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance. In DSN 2007. Google ScholarDigital Library
- D. J. Sorin. Fault Tolerant Computer Architecture. Synthesis Lectures on Computer Architecture,2009. Google ScholarDigital Library
- J. Srinivasan, S. Adve, P. Bose, and J. Rivers. The Impact of Technology Scaling on Lifetime Reliability. In DSN 2004. Google ScholarDigital Library
- C. Wang, H.-S. Kim, Y. Wu, and V. Ying. Compiler-Managed Software-Based Redundant Multi-Threading for Transient Fault Detection. In CGO 2007. Google ScholarDigital Library
- Y. Zhang, S. Ghosh, J. Huang, J. W. Lee, S. A. Mahlke, and D. I. August. Runtime Asynchronous Fault Tolerance via Speculation. In CGO 2012. Google ScholarDigital Library
- Y. Zhang, J. W. Lee, N. P. Johnson, and D. I. August. DAFT: Decoupled Acyclic Fault Tolerance. In PACT 2010. Google ScholarDigital Library
Index Terms
- COMET: communication-optimised multi-threaded error-detection technique
Recommendations
Optimized Software-Based Hardening Strategies for Matrix Multiplication and Fast Fourier Transform
ICACS '18: Proceedings of the 2nd International Conference on Algorithms, Computing and SystemsNowadays, Graphics Processing Unit (GPU) has shown great potential in High-Performance Computing applications for its parallel computing structures, which can greatly accelerate the computing process. However, GPU reliability is critical in some ...
NOVA: A Functional Language for Data Parallelism
ARRAY'14: Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array ProgrammingFunctional languages provide a solid foundation on which complex optimization passes can be designed to exploit parallelism available in the underlying system. Their mathematical foundations enable high-level optimizations that would be impossible in ...
Panda: A Compiler Framework for Concurrent CPU$$+$$+GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers
We present a new compiler framework for truly heterogeneous 3D stencil computation on GPU clusters. Our framework consists of a simple directive-based programming model and a tightly integrated source-to-source compiler. Annotated with a small number of ...
Comments