Skip to main content
Log in

Replicated processors on a single die – How independently do they fail?

Abhängigkeiten von Fehlern und Ausfällen in Multiprozessoren auf einem Chip

  • Originalarbeiten
  • Published:
e & i Elektrotechnik und Informationstechnik Aims and scope Submit manuscript

Zusammenfassung

Eine bekannte und effiziente Fehlertoleranzmethode ist die Verwendung mehrerer Komponenten in Kombination mit einem Ausgangsvergleicher. System-on-chip-Architekturen ermöglichen eine kosteneffiziente Implementierung dieser Methode auf einem Chip. Die resultierende Nähe der einzelnen Komponenten impliziert allerdings ein erhöhtes Risiko zur Fehlerkopplung, weshalb Einzelchip-Lösungen anfälliger für Common Cause-Fehler (CCFs) sind als Lösungen mit mehreren Chips. Bis dato ist jedoch unklar, in welchem Ausmaß diese Kopplung den durch die Replikation erzielten Gewinn an Systemzuverlässigkeit wieder egalisiert. In dieser Arbeit analysieren die Autoren potentielle Kopplungsmechanismen und erörtern, unter welchen Umständen sie zu einem identischen Ausgabewert aller Komponenten führen, da genau in diesem Fall das Prinzip der Replikation versagt. Es werden sowohl Simulation als auch experimentelle Untersuchungen verwendet, um eine quantitative Lösung zu dieser Frage abzuleiten. Speziell liegt der Fokus auf thermischen Effekten und Störungen in der gemeinsam genutzten Spannungsversorgung. Neben der Analyse der relativen Wahrscheinlichkeit von CCFs analysieren die Autoren auch die Effektivität von Gegenmaßnahmen. Sie erarbeiten ein Modell, um den Ursprung dieser CCFs in verschiedene Schritte zu zerlegen, und zeigen, dass CCFs eine enge lokale und zeitliche Übereinstimmung erfordern, was sehr unwahrscheinlich für z. B. thermische Effekte ist. Eine allgemeine Erkenntnis ist, dass selbst geringe Asymmetrien zwischen den Komponenten bereits zu einer drastischen Reduktion der CCFs führen.

Summary

A very popular and efficient method for achieving fault tolerance is replication of components paired with a comparison of their outputs. Systems-on-chip architectures enable a cost-efficient implementation of this scheme on a single die. The resulting close physical proximity of the replica, however, implies an increased coupling, and therefore single-die solutions are more susceptible to common-cause faults (CCFs) than equivalent multi-chip approaches. Unfortunately, no answer could be given so far, to which degree the coupling decreases the dependability gain accomplished by the replication even in a single-die solution. In this paper we analyze potential coupling mechanisms and study under which circumstances they lead to identical outputs in all replica, since exactly in this case the "replication and comparison" scheme will fail. We perform both, simulation studies as well as comprehensive experimental investigations to derive a quantitative answer to this question. Our particular focus is on thermal effects and on the effects of disturbances in a shared power supply in a duplicated processor architecture. Beyond observing the relative probability of occurrence of CCFs, we also study the effectiveness of several countermeasures against them. We elaborate a model to decompose the genesis of CCFs into several steps, and show that very tight local and temporal coincidence of the fault effect in both replica is crucial for a CCF, which is unlikely, e.g. in the case for thermal effects. As a general result it turns out that even small asymmetries between the cores yield a drastic reduction in the CCF probability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

  • Alleman, G. B., Gall, J. (1989): Fault tolerant system reliability in the presence of imperfect diagnostic coverage, Tech Rep. Triconex Corp

  • Barton, J. H., Czeck, E. W., Segall, Z. Z., Siewiorek, D. P. (1990): Fault injection experiments using FIAT. IEEE Trans Comput, 39 (4): 575–582

    Article  Google Scholar 

  • Borcsok, J., Schaefer, S., Ugljesa, E. (2007): Estimation and evaluation of common cause failures, Second International Conference on Systems ICONS 2007, pp. 41–41

  • Constantinescu, C. (2002): Impact of deep submicon technology on dependability of VLSI circuits. In: Proc. of the Int. Conference on Dependable Systems and Networks (DSN'02), pp. 205–209

  • Grigull, U., Sandner, H. (1984): Heat conduction. In: Springer

  • IEC/EN 61508 (1999–2002): Functional safety of electrical/electronic/programmable electronic safety-related systems (E/E/PES)

  • Kanawati, G. A., Kanawati, N. A., Abraham, J. A. (1995): FERRARI: a flexible software-based fault and error injection system. IEEE Trans Comput, 44 (2): 248–260

    Article  MATH  Google Scholar 

  • Kanekawa, N., Meguro, T., Isono, K., Shima, Y., Miyazaki, N., Yamaguchi, S. (1998): Fault detection and recovery coverage improvement by clock synchronized duplicated systems with optimal time diversity. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, 1998. pp. 196–200

  • Kaufman, L., Bhide, S., Johnson, B. (2000): Modeling of common-mode failures in digital embedded systems, Reliability and Maintainability Symposium, 2000. Proceedings. Annual, pp. 350–357

  • Kottke, T., Steininger, A (2006): A reconfigurable generic dual-core architecture, in DSN '06: Proceedings of the International Conference on Dependable Systems and Networks. IEEE Computer Society, 45–54

  • Kundu, S., Sogomonyan, E. S., Goessel, M., Tarnick, S. (1996): Self-checking comparator with one periodic output. IEEE Trans Comput, 45 (3): 379–380

    Article  MATH  Google Scholar 

  • Mauri, G. (2000): Integrating safety analysis techniques, supporting identification of common cause failures. PhD thesis, University of York, Department of Computer Science

  • Tummeltshammer, P., Steininger, A. (2009a): On the risk of fault coupling over the chip substrate, 12th Euromicro Conference on Digital System Design, DSD'09

  • Tummeltshammer, P., Steininger, A. (2009b): Power supply induced common cause faults–experimental assessment of potential countermeasures, IEEE International Conference on Dependable Systems and Networks (DSN 2009)

  • Tummeltshammer, P., Steininger, A. (2009c): On the role of the power supply as an entry for common cause faults–an experimental analysis, 11th IEEE Workshop on Design and Diagnostics of Electronic Circuits and Systems DDECS'09

  • Vinter, J., Aidemark, J., Skarin, D., Barbosa, R., Folkesson, P., Karlsson, J. (2005): An overview of GOOFI – a generic object-oriented fault injection framework, Chalmers University of Technology, Department of Computer Science and Engineering Division of Computer Engineering. Tech Rep

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tummeltshammer, P., Steininger, A. Replicated processors on a single die – How independently do they fail?. Elektrotech. Inftech. 128, 245–250 (2011). https://doi.org/10.1007/s00502-011-0005-9

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00502-011-0005-9

Schlüsselwörter

Keywords

Navigation