research-article

COMET: communication-optimised multi-threaded error-detection technique

Authors:
Konstantina Mitropoulou

University of Cambridge, UK

University of Cambridge, UK
View Profile

,
Vasileios Porpodas

Intel

Intel
View Profile

,
Timothy M. Jones

University of Cambridge, UK

University of Cambridge, UK
View Profile

CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded SystemsOctober 2016Article No.: 7Pages 1–10https://doi.org/10.1145/2968455.2968508

Published:01 October 2016Publication History

CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Pages 1–10

ABSTRACT

Relentless technology scaling has made transistors more vulnerable to soft, or transient, errors. To keep systems robust against these, current error detection techniques use different types of redundancy at the hardware or the software level. A consequence of these additional protection mechanisms is that these systems tend to become slower. In particular, software error-detection techniques degrade performance considerably, limiting their uptake.

This paper focuses on software redundant multi-threading error detection, a compiler-based technique that makes use of redundant cores within a multi-core system to perform error checking. Implementations of this scheme feature two threads that execute almost the same code: the main thread runs the original code and the checker thread executes code to verify the correctness of the original. The main thread communicates the values that require checking to the checker thread to use in its comparisons.

We identify a major performance bottleneck in existing schemes: poorly performing inter-core communication and the generated code associated with it. Our study shows this is a major performance impediment within existing techniques since the two threads require extremely fine-grained communication, on the order of every few instructions. We alleviate this bottleneck with a series of code generation optimisations at the compiler level. We propose COMET (Communication-Optimised Multi-threaded Error-detection Technique), which improves performance across the NAS parallel benchmarks by 31.4% (on average) compared to the state-of-the-art, without affecting fault-coverage.

References

GCC: GNU Compiler Collection. http://gcc.gnu.org.Google Scholar
The LLVM Compiler Infrastructure. http://llvm.org.Google Scholar
NAS Parallel Benchmarks. http://www.nas.nasa.gov/publications/npb.html.Google Scholar
PERF: Linux Profiling With Performance Counters. https://perf.wiki.kernel.org.Google Scholar
D. Bernick, B. Bruckert, P. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen. NonStop Advanced Architecture. In DSN 2005. Google ScholarDigital Library
J. Chang, G. Reis, and D. August. Automatic Instruction-Level Software-Only Recovery. In DSN 2006. Google ScholarDigital Library
C. Constantinescu. Trends and Challenges in VLSI Circuit Reliability. IEEE Micro 2003. Google ScholarDigital Library
M. L. Fair, C. R. Conklin, S. Swaney, P. Meaney, W. Clarke, L. Alves, I. N. Modi, F. Freier, W. Fischer, and N. E. Weber. Reliability, Availability, and Serviceability (RAS) of the IBM eServer Z990. IBM Journal of Research and Development 2004. Google ScholarDigital Library
S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: Probabilistic Soft Error Reliability on the Cheap. In ASPLOS 2010. Google ScholarDigital Library
K. Gharachorloo and P. B. Gibbons. Detecting Violations of Sequential Consistency. In Proceedings of SPAA 1991. Google ScholarDigital Library
W.-M. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, et al. The Superblock: An Effective Technique for VLIW and Superscalar Compilation. the Journal of Supercomputing 1993. Google ScholarDigital Library
T. B. Jablin, Y. Zhang, J. A. Jablin, J. Huang, H. Kim, and D. I. August. Liberty Queues for EPIC Architectures. In Proceedings of EPIC Workshop 2010.Google Scholar
L. Lamport. Specifying Concurrent Program Modules. TOPLAS 1983. Google ScholarDigital Library
P. P. Lee, T. Bu, and G. Chandranmenon. A Lock-Free, Cache-Efficient Shared Ring Buffer for Multi-Core Architectures. In ANCS 2009. Google ScholarDigital Library
P. P. Lee, T. Bu, and G. Chandranmenon. A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring. In IPDPS 2010.Google Scholar
P. G. Lowney, S. M. Freudenberger, T. J. Karzes, W. Lichtenstein, R. P. Nix, J. S. O'donnell, and J. C. Ruttenberg. The Multiflow Trace Scheduling Compiler. The journal of Supercomputing, 1993. Google ScholarDigital Library
S. A. Mahlke, W. Y. Chen, W.-m. W. Hwu, B. R. Rau, and M. S. Schlansker. Sentinel Scheduling for VLIW and Superscalar Processors. In ASPLOS 1992. Google ScholarDigital Library
S. Michalak, K. Harris, N. Hengartner, B. Takala, and S. Wender. Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory's ASC Q Supercomputer. IEEE Transactions on Device and Materials Reliability 2005.Google Scholar
K. Mitropoulou, V. Porpodas, and M. Cintra. DRIFT: Decoupled compileR-based Instruction-level Fault-Tolerance. In LCPC 2013.Google Scholar
K. Mitropoulou, V. Porpodas, X. Zhang, and T. M. Jones. Lynx: Using OS and Hardware Support for Fast Fine-Grained Inter-Core Communication. In ICS 2016. Google ScholarDigital Library
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed Design and Evaluation of Redundant Multithreading Alternatives. In ISCA 2002. Google ScholarDigital Library
N. Oh, P. Shirvani, and E. McCluskey. Error Detection by Duplicated Instructions in Super-scalar Processors. IEEE Transactions on Reliability 2002. Google ScholarDigital Library
S. K. Reinhardt and S. S. Mukherjee. Transient Fault Detection via Simultaneous Multithreading. In ISCA 2000. Google ScholarDigital Library
G. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. August. SWIFT: Software Implemented Fault Tolerance. In CGO 2005. Google ScholarDigital Library
E. Rotenberg. AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors. In International Symposium on Fault-Tolerant Computing 1999. Google ScholarDigital Library
P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi. Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic. In DSN 2002. Google ScholarDigital Library
A. Shye, T. Moseley, V. Reddi, J. Blomstedt, and D. Connors. Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance. In DSN 2007. Google ScholarDigital Library
D. J. Sorin. Fault Tolerant Computer Architecture. Synthesis Lectures on Computer Architecture,2009. Google ScholarDigital Library
J. Srinivasan, S. Adve, P. Bose, and J. Rivers. The Impact of Technology Scaling on Lifetime Reliability. In DSN 2004. Google ScholarDigital Library
C. Wang, H.-S. Kim, Y. Wu, and V. Ying. Compiler-Managed Software-Based Redundant Multi-Threading for Transient Fault Detection. In CGO 2007. Google ScholarDigital Library
Y. Zhang, S. Ghosh, J. Huang, J. W. Lee, S. A. Mahlke, and D. I. August. Runtime Asynchronous Fault Tolerance via Speculation. In CGO 2012. Google ScholarDigital Library
Y. Zhang, J. W. Lee, N. P. Johnson, and D. I. August. DAFT: Decoupled Acyclic Fault Tolerance. In PACT 2010. Google ScholarDigital Library

Index Terms

COMET: communication-optimised multi-threaded error-detection technique
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Reliability
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Optimized Software-Based Hardening Strategies for Matrix Multiplication and Fast Fourier Transform
ICACS '18: Proceedings of the 2nd International Conference on Algorithms, Computing and Systems

Nowadays, Graphics Processing Unit (GPU) has shown great potential in High-Performance Computing applications for its parallel computing structures, which can greatly accelerate the computing process. However, GPU reliability is critical in some ...
Read More
NOVA: A Functional Language for Data Parallelism
ARRAY'14: Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming

Functional languages provide a solid foundation on which complex optimization passes can be designed to exploit parallelism available in the underlying system. Their mathematical foundations enable high-level optimizations that would be impossible in ...
Read More
Panda: A Compiler Framework for Concurrent CPU$$+$$+GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

We present a new compiler framework for truly heterogeneous 3D stencil computation on GPU clusters. Our framework consists of a simple directive-based programming model and a tightly integrated source-to-source compiler. Annotated with a small number of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems
October 2016
187 pages
ISBN:9781450344821
DOI:10.1145/2968455

Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
code generation
communication optimisations
error detection
soft errors
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate52of230submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 116
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

COMET: communication-optimised multi-threaded error-detection technique

CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Optimized Software-Based Hardening Strategies for Matrix Multiplication and Fast Fourier Transform

NOVA: A Functional Language for Data Parallelism

Panda: A Compiler Framework for Concurrent CPU$$+$$+GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

COMET: communication-optimised multi-threaded error-detection technique

CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Optimized Software-Based Hardening Strategies for Matrix Multiplication and Fast Fourier Transform

NOVA: A Functional Language for Data Parallelism

Panda: A Compiler Framework for Concurrent CPU$$+$$+GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media