Analysis of Relationship Between Modes of Intercomputer Communications and Fault Types in Redundant Computer Systems

Samet, Refik; Samet, Nermin

doi:10.1007/978-3-662-54563-8_1

Analysis of Relationship Between Modes of Intercomputer Communications and Fault Types in Redundant Computer Systems

Refik Samet¹⁵ &
Nermin Samet¹⁶

Chapter
First Online: 12 March 2017

375 Accesses

Part of the book series: Lecture Notes in Computer Science ((TCOMPUTATSCIE,volume 10220))

Abstract

This paper analyzes the reasons of appearance of non - Byzantine and Byzantine fault types in redundant computer systems. The proposed approach is based on analysis of the relationship between the modes of intercomputer communications and fault types. This analysis allows the users to design the redundant computer systems in such a way that Byzantine faults cannot appear. Consequently, designing the redundant computer systems, in which Byzantine faults cannot appear, allows the designers to increase the degree of reliability by preventing the masking of any forms of appearance of faults and by decreasing the time period of checkpoints. In addition, this approach decreases the cost of software and hardware involved in the execution of fault-tolerant procedures.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secur. Comput. 1(1), 11–33 (2004)
Article Google Scholar
Bentley, J.: Introduction to Reliability and Quality Engineering. Addison-Wesley, Reading (1999)
Google Scholar
Pradhan, D.K. (ed.): Fault-tolerant Computer System Design. Prentice-Hall Inc., Upper Saddle River (1996)
Google Scholar
Kwak, S.W., Choi, B.J., Kim, B.K.: An optimal checkpointing-strategy for real-time control systems under transient faults. IEEE Trans. Reliab. 50(3), 293–301 (2001)
Article Google Scholar
Zhang, Y., Jiang, J.: Integrated active fault-tolerant control using IMM approach. IEEE Trans. Aerosp. Electron. Syst. 37(4), 1221–1235 (2001)
Article Google Scholar
Alvisi, L., Malkhi, D., Pierce, E., Reiter, M.K.: Fault detection for Byzantine quorum systems. IEEE Trans. Parallel Distrib. Syst. 12(9), 996–1007 (2001)
Article Google Scholar
Lamport, L., Shostak, R., Pease, M.: The Byzantine generals problem. ACM Trans. Program. Lang. Syst. 4(3), 382–401 (1982)
Article MATH Google Scholar
Lima, G.M., Burns, A.: A consensus protocol for CAN-based systems. In: 24th IEEE Real-Time Systems Symposium, RTSS 2003, pp. 420–429. IEEE (2003)
Google Scholar
Cristian, F., Aghili, H., Strong, R., Dolev, D.: Atomic broadcast: from simple message diffusion to Byzantine agreement. Inf. Comput. 118(1), 158–179 (1995)
Article MathSciNet MATH Google Scholar
Pelc, A., Peleg, D.: Broadcasting with locally bounded Byzantine faults. Inf. Process. Lett. 93(3), 109–115 (2005)
Article MathSciNet MATH Google Scholar
Fitzi, M., Gottesman, D., Hirt, M., Holenstein, T., Smith, A.: Detectable Byzantine agreement secure against faulty majorities. In: Proceedings of the Twenty-First Annual Symposium on Principles of Distributed Computing, pp. 118–126. ACM (2002)
Google Scholar
Fitzi, M., Hirt, M.: Optimally efficient multi-valued Byzantine agreement. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Principles of Distributed Computing, pp. 163–168. ACM (2006)
Google Scholar
Bao, F., Igarishi, Y.: Reliable broadcasting in product networks with Byzantine faults. In: Proceedings of Annual Symposium on Fault Tolerant Computing, pp. 262–271. IEEE (1996)
Google Scholar
Keichafer, R.M., Walter, C.J., Finn, A.M., Thambidurai, P.M.: The MAFT architecture for distributed fault tolerance. IEEE Trans. Comput. 37(4), 398–404 (1988)
Article Google Scholar
Powell, D., Arlat, J., Beus-Dukic, L., Bondavalli, A., Coppola, P., Fantechi, A., Jenn, E., Rabejac, C., Wellings, A.: GUARDS: a generic upgradable architecture for real-time dependable systems. IEEE Trans. Parallel Distrib. Syst. 10(6), 580–599 (1999)
Article Google Scholar
Totel, E., Beus-Dukic, L., Blanquart, J.P., Deswarte, Y., Powell, D., Wellings, A.: Integrity management in GUARDS. In: Davies, N., Jochen, S., Raymond, K. (eds.) Middleware 1998, pp. 105–122. Springer, London (1998)
Google Scholar
Palumbo, D.L., Butler, R.W.: A performance evaluation of the software-implemented fault-tolerance computer. J. Guidance Control Dyn. 9(2), 175–180 (1986)
Article Google Scholar
Hopkins, A.L., Smith, T.B., Lala, J.H.: FTMP: a highly reliable fault-tolerant multiprocess for aircraft. Proc. IEEE 66(10), 1221–1239 (1978)
Article Google Scholar
Han, S., Shin, K.G.: Experimental evaluation of failure-detection schemes in real-time communication networks. In: Twenty-Seventh Annual International Symposium on Fault-Tolerant Computing, FTCS-27, Digest of Papers, pp. 122–131. IEEE (1997)
Google Scholar
Rufino, J., Verissimo, P., Arroz, G., Almeida, C., Rodrigues, L.: Fault-tolerant broadcasts in CAN. In: Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, Digest of Papers, pp. 150–159. IEEE (1998)
Google Scholar
AlMohammad, B., Bose, B.: Fault-tolerant communication algorithms in toroidal networks. IEEE Trans. Parallel Distrib. Syst. 10(10), 976–983 (1999)
Article Google Scholar
Hsieh, H.C., Chiang, M.L.: A new solution for the Byzantine agreement problem. J. Parallel Distrib. Comput. 71(10), 1261–1277 (2011)
Article MATH Google Scholar
Saini, P., Singh, A.K.: An efficient Byzantine fault tolerant agreement. In: AIP Conference Proceedings, vol. 1324, no. 1 (2010)
Google Scholar
Wang, S.S., Yan, K.Q., Wang, S.C.: An optimal solution for Byzantine agreement under a hierarchical cluster-oriented mobile ad hoc network. Comput. Electr. Eng. 36(1), 100–113 (2010)
Article MATH Google Scholar
Moniz, H., Neves, N.F., Correia, M.: Byzantine fault-tolerant consensus in wireless ad hoc networks. IEEE Trans. Mobile Comput. 12(12), 2441–2454 (2013)
Article Google Scholar
Veronese, G.S., Correia, M., Bessani, A.N., Lung, L.C., Verissimo, P.: Efficient Byzantine fault-tolerance. IEEE Trans. Comput. 62(1), 16–30 (2013)
Article MathSciNet Google Scholar
Kotla, R., Clement, A., Wong, E., Alvisi, L., Dahlin, M.: Zyzzyva: speculative Byzantine fault tolerance. Commun. ACM 51(11), 86–95 (2008)
Article Google Scholar
Keidar, I., Rajsbaum, S.: On the cost of fault-tolerant consensus when there are no faults: preliminary version. SIGACT News 32(2), 45–63 (2001)
Article Google Scholar
Banu, N., Izumi, T., Wada, K.: Adaptive and doubly-expedited one-step consensus in Byzantine asynchronous systems. Parallel Process. Lett. 21(04), 461–477 (2011)
Article MathSciNet MATH Google Scholar
Patra, A., Choudhury, A., Rangan, C.P.: Asynchronous Byzantine agreement with optimal resilience. Distrib. Comput. 27(2), 111–146 (2014)
Article MathSciNet MATH Google Scholar
Xu, X., Lin, Y.: Checkpoint selection in fault recovery based on Byzantine fault model. In: Fourth International Conference on Computational Intelligence and Communication Networks (CICN), pp. 582–587, November 2012
Google Scholar
Widder, J., Biely, M., Gridling, G., Weiss, B., Blanquart, J.P.: Consensus in the presence of mortal Byzantine faulty processes. Distrib. Comput. 24(6), 299–321 (2012)
Article MATH Google Scholar
Wang, S.C., Yan, K.Q., Ho, C.L., Wang, S.S.: The optimal generalized Byzantine agreement in cluster-based wireless sensor networks. Comput. Stan. Interfaces 36(5), 821–830 (2014)
Article Google Scholar
Abdelhakim, M., Lightfoot, L.E., Ren, J., Li, T.: Distributed detection in mobile access wireless sensor networks under Byzantine attacks. IEEE Trans. Parallel Distrib. Syst. 25(4), 950–959 (2014)
Article Google Scholar
Duran, A., Ferrer, R., Costa, J.J., Gonzàlez, M., Martorell, X., Ayguadé, E., Labarta, J.: A proposal for error handling in OpenMP. Int. J. Parallel Prog. 35(4), 393–416 (2007)
Article MATH Google Scholar
Bronevetsky, G., Marques, D., Pingali, K., Szwed, P., Schulz, M.: Application-level checkpointing for shared memory programs. In: Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS XI, pp. 235–247. ACM, New York (2004)
Google Scholar
Bronevetsky, G., Pingali, K., Stodghill, P.: Experimental evaluation of application-level checkpointing for OpenMP programs. In: Proceedings of the 20th Annual International Conference on Supercomputing, ICS 2006, pp. 2–13. ACM, New York (2006)
Google Scholar
Fu, H., Ding, Y.: Using redundant threads for fault tolerance of OpenMP programs. In: 2010 International Conference on Information Science and Applications, pp. 1–8, April 2010
Google Scholar
Li, M., Hsiao, M.S.: 3-D parallel fault simulation with GPGPU. IEEE Trans. Comput. Aided Design Integr. Circuits Syst. 30(10), 1545–1555 (2011)
Article Google Scholar
Guo, X., Jiang, H., Li, K.C.: A checkpoint/restart scheme for CUDA applications with complex memory hierarchy. In: 14th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), pp. 247–252, July 2013
Google Scholar
Carlo, S.D., Gambardella, G., Martella, I., Prinetto, P., Rolfo, D., Trotta, P.: Fault mitigation strategies for CUDA GPUs. In: 2013 IEEE International Test Conference (ITC), pp. 1–8, September 2013
Google Scholar
Xu, X.H., Yang, X.J., Xue, J.L., Lin, Y.F., Lin, Y.S.: PartialRC: a partial recomputing method for efficient fault recovery on GPGPUs. J. Comput. Sci. Technol. 27(2), 240–255 (2012)
Article Google Scholar
Laosooksathit, S., Nassar, R., Leangsuksun, C., Paun, M.: Reliability-aware performance model for optimal GPU-enabled cluster environment. J. Supercomputing 68(3), 1630–1651 (2014)
Article Google Scholar
Demchik, V., Kolomoyets, N.: QCDGPU: open-source package for Monte Carlo lattice simulations on OpenCL-compatible multi-GPU systems (2013)
Google Scholar
Avizienis, A.: Fault-tolerance: a property that ensures constant availability of digital system. IEEE Trans. Comput. 66(10), 5–25 (1978)
MathSciNet Google Scholar
Pease, M., Shostak, R., Lamport, L.: Reaching agreement in the presence of faults. J. ACM 27(2), 228–234 (1980)
Article MathSciNet MATH Google Scholar
Mamedli, È.M., Samedov, R.Y., Sobolev, N.: A method for localization of Byzantine and nonbyzantine faults. Avtomatika i Telemekhanika 5, 126–138 (1992)
MATH Google Scholar
Samet, R.: Recovery device for real-time dual-redundant computer systems. IEEE Trans. Dependable Secure Comput. 8(3), 391–403 (2011)
Article Google Scholar
Samet, R.: Choosing between design options for real-time computers tolerating a single fault. J. Circuits Syst. Comput. 19(05), 1041–1068 (2010)
Article Google Scholar
Sivencrona, H., Johannessen, P., Persson, M., Torin, J.: Heavy-ion fault injections in the time-triggered communication protocol. In: Lemos, R., Weber, T.S., Camargo, J.B. (eds.) LADC 2003. LNCS, vol. 2847, pp. 69–80. Springer, Heidelberg (2003). doi:10.1007/978-3-540-45214-0_8
Chapter Google Scholar
Driscoll, K., Hall, B., Sivencrona, H., Zumsteg, P.: Byzantine fault tolerance, from theory to reality. In: Anderson, S., Felici, M., Littlewood, B. (eds.) SAFECOMP 2003. LNCS, vol. 2788, pp. 235–248. Springer, Heidelberg (2003). doi:10.1007/978-3-540-39878-3_19
Chapter Google Scholar
Tanenbaum, A.S.: Computer Networks, vol. 3. Prentice Hall, New Jersey (1996)
MATH Google Scholar
Stallings, W.: Data and computer communications. Pearson/Prentice Hall (2007)
Google Scholar
Mullender, S.: Distributed Systems. ACM Press/Addison-Wesley Publishing Co. (1993)
Google Scholar
Coulouris, G.F., Dollimore, J., Kindberg, T.: Distributed Systems: Concepts and Design. Pearson education (2005)
Google Scholar
Mitra, S., Saxena, N.R., McCluskey, E.J.: A design diversity metric and analysis of redundant systems. IEEE Trans. Comput. 51(5), 498–510 (2002)
Article Google Scholar
Samedov, R.: An approach to the support of the fault-tolerance of the double redundant computer control systems. Math. Comput. Appl. 4(2), 175–184 (1999)
Google Scholar
Kim, H., Jeon, H.J., Lee, K., Lee, H.: The design and evaluation of all voting triple modular redundancy system. In: Proceedings. Annual Reliability and Maintainability Symposium, pp. 439–444. IEEE (2002)
Google Scholar
Smith, T.B.: Fault tolerant processor concepts and operation. In: Digest of Papers, FTCS-14, Kissimmee, USA, pp. 158–163 (1984)
Google Scholar
Laprie, J.C.: Dependable computing and fault-tolerance. In: Digest of Papers FTCS-15, pp. 2–11 (1985)
Google Scholar
Mamedli, È.M., Samedov, R.Y., Sobolev, N.: A method for localization of Byzantine and NonByzantine faults. J. Autom. Remote Control 53(5), 734–744 (1992)
MATH Google Scholar
Oh, N., Mitra, S., McCluskey, E.J.: ED4I: error detection by diverse data and duplicated instructions. IEEE Trans. Comput. 51(2), 180–199 (2002)
Article Google Scholar
Siewiorek, D.P., Swarz, R.S.: Reliable Computer Systems: Design and Evaluation, 3rd edn. A.K. Peters Ltd., Natick (1998)
MATH Google Scholar
Samet, R.: Fault-tolerant procedures for redundant computer systems. Qual. Reliab. Eng. Int. 25(1), 41–68 (2009)
Article Google Scholar
Hurst, S.L.: VLSI Testing: digital and mixed analogue/digital techniques, vol. 9. IET (1998)
Google Scholar
Lala, P.K.: Self-checking and fault-tolerant digital design. Morgan Kaufmann (2001)
Google Scholar
Powell, D.: Failure mode assumptions and assumption coverage. In: Randell, B., Laprie, J.C., Kopetz, H., Littlewood, B. (eds.) Predictably Dependable Computing Systems, pp. 123–140. Springer, Heidelberg (1995)
Chapter Google Scholar
Laprie, J.C., Arlat, J., Blanquart, J., Costes, A., Crouzet, Y., Deswarte, Y., Fabre, J., Guillermain, H., Kaâniche, M., Kanoun, K., et al.: Guide de la sûreté de fonctionnement (dependability handbook). Cépaduès, Toulouse (1995)
Google Scholar
Ziv, A., Bruck, J.: An on-line algorithm for checkpoint placement. IEEE Trans. Comput. 46(9), 976–985 (1997)
Article MathSciNet Google Scholar
Ling, Y., Mi, J., Lin, X.: A variational calculus approach to optimal checkpoint placement. IEEE Trans. Comput. 50(7), 699–708 (2001)
Article Google Scholar
Lincoln, P., Rushby, J.: A formally verified algorithm for interactive consistency under a hybrid fault model. In: The Twenty-Third International Symposium on Fault-Tolerant Computing, FTCS-23, Digest of Papers, pp. 402–411. IEEE (1993)
Google Scholar
Meyer, F.J., Pradhan, D.K.: Consensus with dual failure modes. IEEE Trans. Parallel Distrib. Syst. 2(2), 214–222 (1991)
Article Google Scholar
Thambidurai, P., Park, Y.K.: Interactive consistency with multiple failure modes. In: Proceedings, Seventh Symposium on Reliable Distributed Systems, pp. 93–100. IEEE (1988)
Google Scholar
Chor, B., Coan, B.A.: A simple and efficient randomized Byzantine agreement algorithm. IEEE Trans. Softw. Eng. 6, 531–539 (1985)
Article MathSciNet Google Scholar
Kopetz, H.: Real-Time Systems: Design Principles for Distributed Embedded Applications. Springer Science & Business Media, London (2011)
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

Ankara University, Ankara, Turkey
Refik Samet
Middle East Technical University, Ankara, Turkey
Nermin Samet

Authors

Refik Samet
View author publications
You can also search for this author in PubMed Google Scholar
Nermin Samet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Refik Samet .

Editor information

Editors and Affiliations

University of Calgary , Calgary, Alberta, Canada
Marina L. Gavrilova
Sardina Systems , Tallinn, Estonia
C.J. Kenneth Tan

Appendix I: Examples

Example 1

Let us consider a Redundant Computer System (RCS) which consists of four computers (nodes) (Fig. 1 with \(N=4\)). Suppose that the fourth node is faulty and computational results of non-faulty nodes are “1”. If the fourth node sends the same values, namely logical “0”, to all others during the exchange by computational results it means that the fault type is Non-Byzantine.

As we see, vectors consist of the same values in all non-faulty nodes. However, if the fourth node sends different values, namely to the first and third nodes logical “0” and to the second node logical “1”, it means that the fault type is Byzantine.

As we see, vectors consist of different values in non-faulty nodes. The index of x (“0” or “1”) refers to values in faulty nodes.

Example 2

For explanation of the masking of fault forms, the Determinate Byzantine agreement protocol will be used. Suppose that \(N=4\), \(k=1\), consequently, \(m=2\) (Table 3). Let us consider the examples for three cases: (1) there is no fault in RCS; (2) the Byzantine fault appears in RCS during the first intercomputer communication round and (3) the Byzantine fault appears in RCS during the second intercomputer communication round.

Example 2.1. First, let us consider the case when there is no fault in RCS. Suppose that computational results of nodes are the logical “1”.

In the first intercomputer communication round, nodes exchange the computational results. So after the first round, the following vectors are formed.

In the second intercomputer communication round, the \(i^{th}\) node transmits to the \(j^{th}\) node the computational result of the \(n^{th}\) node, where i, j, \(n= 1, 2, 3, 4\) and \(i \ne j \ne n\). After the second round, the matrices of the computational results are formed.

After majority voting in columns, the final vectors are formed. Element \(a_{nn}\) is chosen as majority of values in the \(n^{th}\) column, where n = 1, 2, 3, 4.

As we see, all matrices consist of the same computational results and all nodes forms the same final vectors. This means that there is no fault in RCS.

Example 2.2. Now, let us consider the case when the Byzantine fault appears in the first intercomputer communication round. Suppose that the fourth node is faulty and computational results of nodes are the logical “1”, “1”, “1”, “x” (“0” or “1”).

Suppose that the faulty node sends to the first and third nodes the logical “0” and to the second node the logical “1” during the first intercomputer communication round. After the first round, the following vectors are formed.

After the second round, the following matrices and after majority voting in columns, the following final vectors are formed.

As we see, the matrices in all non-faulty nodes consist of different computational results. However, all non-faulty nodes form the same final vectors, where the fourth computational result differs from others. So the non-faulty nodes determine that the fourth node is faulty. In this case, the Determinate Byzantine agreement algorithm transforms the Byzantine fault type appearing in the first intercomputer communication round to Non-Byzantine type.

Example 2.3. Finally, let us consider the case when the Byzantine fault appears in the second intercomputer communication round. Suppose that the computational results of nodes are the logical “1”.

After the first round, the following vectors are formed.

Suppose that the Byzantine fault appears in the fourth node during the second intercomputer communication round. The faulty node will send different computational results to the non-faulty nodes. After the second round, the following matrices are formed.

After majority voting in columns, the following final vectors are formed.

As we see, the matrices in all non-faulty nodes consist of different computational results. Despite this, all non-faulty nodes form the same final vectors where all computational results are the same. The non-faulty nodes determine that there is no fault in RCS. This means that the Determinate Byzantine agreement algorithm masks the Byzantine fault type that appeared in the second round. The same situation will take place when the Non-Byzantine fault appears in RCS during the second intercomputational round. As result, the Byzantine agreement algorithms mask the Non-Byzantine and Byzantine fault appearance forms that occurred during the second and following rounds. Masking of faults is very dangerous because they can be accumulated and lead the system to failure.

Example 3

Let us show that the use of a Byzantine agreement algorithm (for example, the Determinate Byzantine agreement algorithm) cannot counteract the Byzantine fault in RCS with \(N=3\) (Fig. 1 with \(N=3\)). Suppose that the third node is faulty and computational results of nodes are “1”, “1”, “x” (“0” or “1”).

Suppose that the faulty node sends to the first node the logical “0” and to the second node the logical “1” during the first intercomputer communication round. So after the first round, the following vectors are formed.

After the second round, the following matrices are formed.

After majority voting in columns, the following final vectors are formed. As we see, all non-faulty nodes form the different final vectors and could not determine the faulty node. So the Byzantine agreement algorithms cannot detect the Byzantine fault in RCS with \(N=3\).

Example 4

Let us consider RCS where Non-Byzantine and Byzantine fault types may occur (Fig. 12).

Suppose that the \(n^{th}\) (\(n=1, 2, 3, 4\)) computer (node) in RCS consists of Central Processor (CP), Input Processor (IP) and Output Processor (OP) as shown in Fig. 12.

Each CP controls its own computational process, computes its own computational result and executes its own fault-tolerant procedure on the basis of IDS which consists of computational results of all nodes in RCS. Each IP consists of 4 receivers (R1, R2, R3, and R4) which receive the computational results from the other nodes. Each OP consists of 4 transmitters (T1, T2, T3, and T4) which transmit the computational results to the other nodes.

On the one hand, in this protocol, OP of the \(n^{th}\) node (\(n=1, 2, 3, 4\)) transmits the computational results to IP of the \(i^{th}\) nodes (\(i=1, 2, 3, 4\) and \(n\) \(\ne \) \(i\)) in parallel, by using three busses. Nodes execute this procedure in sequential order by using three busses in the time-sharing mode. For example:

T1, T2 and T3 of OP of the \(1^{st}\) node transmit the computational result in the instant of time \(t_1\) to other nodes by using three busses in parallel;
T1, T2 and T3 of OP of the \(2^{nd}\) node transmit the computational result in instant of time \(t_2\) to other nodes by using three busses in parallel;
T1, T2 and T3 of OP of the \(3^{rd}\) node transmit the computational result in instant of time \(t_3\) to other nodes by using three busses in parallel;
T1, T2 and T3 of OP of the \(4^{th}\) node transmit computational result in instant of time \(t_4\) to other nodes by using three busses in parallel.

On the other hand, in this protocol, IP of the \(n^{th}\) node (\(n=1, 2, 3, 4\)) receives the computational results from OP of the \(i^{th}\) nodes (\(i=1, 2, 3, 4\) and \(n\) \(\ne \) \(i\)) in the time-sharing mode by using one bus. All nodes execute this procedure in parallel. For example:

R1 of IP of the \(1^{st}\) node receives the computational results from T1 of the \(2^{nd}\), \(3^{rd}\) and \(4^{th}\) nodes in instants of times \(t_1\), \(t_3\) and \(t_4\) accordingly by using one bus;
R1 of IP of the \(2^{nd}\) node receives the computational results from T1 of the \(1^{st}\) node and from T2 of \(3^{rd}\) and \(4^{th}\) nodes in instants of times \(t_1\), \(t_3\) and \(t_4\) accordingly by using one bus;
R1 of IP of the \(3^{rd}\) node receives the computational results from T2 of the \(1^{st}\) and \(2^{nd}\) nodes and from T3 of the \(4^{th}\) node in instants of times \(t_1\), \(t_2\) and \(t_4\) accordingly by using one bus;
R1 of IP of the \(4^{th}\) node receives the computational results from T3 of the \(1^{st}\), \(2^{nd}\) and \(3^{rd}\) nodes in instants of times \(t_1\), \(t_2\) and \(t_3\) accordingly by using one bus.

We assume that only one fault may occur in RCS in any instant of time (Sect. 2.4). According to this assumption, three cases may take place:

(1)
Fault occurs in IP (CP and OP are non-faulty) of the \(n^{th}\) node. The functions of IP are to receive the computational results from the other nodes and to save them in its buffer memory. According to assumption if IP is faulty, CP and OP are non-faulty in the \(n^{th}\) (\(n=1, 2, 3, 4\)) node.
- Faulty IP may change the received correct computational results from the \(i^{th}\) (\(i=1, 2, 3, 4\) and \(n \ne i\)) nodes in the first and following communication rounds and save correct or incorrect computational results in its buffer memory.
- Non-faulty CP computes its own correct computational result.
- Non-faulty OP transmits the same (correct) computational result (which is computed in non-faulty CP and which was not received and changed by faulty IP) in the first communication round and the same (possibly correct or incorrect) computational results (which were received and may be changed by faulty IP in the first and following communication rounds) in the second and the following communication rounds to the other nodes.
In this case, transmitters of non-faulty OP will transmit the same computational results (it is not important whether it is correct or incorrect) to other nodes during all communication rounds. Consequently, if a fault occurs in IP, the type of appeared faults in the above RCS structure is only Non-Byzantine.
(2)
Fault occurs in CP (IP and OP are non-faulty) of the \(n^{th}\) node. The functions of CP are to control its own computational process, to compute its own computational result and to execute its own fault-tolerant procedure on the basis of IDS which consists of computational results of all nodes in RCS. According to our assumption, if CP is faulty, IP and OP are non-faulty in the \(n^{th}\) node.
- Non-faulty IP receives the same (correct) computational results from the other nodes in the first and following communication rounds.
- Faulty CP may compute its own incorrect computational result.
- Non-faulty OP transmits the same (may be correct or incorrect) computational result (which is computed in faulty CP of the \(n^{th}\) node) in the first communication round and the same (correct) computational results (which were received by non-faulty IP in the first and the following communication rounds) in the second and following communication rounds.
In this case, transmitters will also transmit the same computational results (it is not important whether it is correct or incorrect) to other nodes during all communication rounds. Consequently, if fault occurs in CP, the type of appeared faults in the above mentioned RCS structure is also only Non-Byzantine.
(3)
Fault occurs in OP (CP and IP are non-faulty) of the \(n^{th}\) node. The main function of OP is to transmit the computational results to the other nodes. According to our assumption, if OP is faulty, IP and CP are non-faulty in the \(n^{th}\) node.
- Non-faulty IP receives the correct computational results from the other nodes in the first and following communication rounds.
- Non-faulty CP computes its own correct computational result.
- Faulty OP might transmit the different computational results to other nodes during all communication rounds because OP has multiple transmitters (in this case three of them are used) for transmission and one or more of them might be faulty and may change the transmitted values.
Consequently, if fault occurs in OP, the type of appeared faults in the above RCS structure is Non-Byzantine or Byzantine.

Let us change the connections between nodes in RCS in Fig. 12 so that the Byzantine fault type could not occur (Fig. 13).

On the one hand, in this protocol, only one transmitter of OP of the \(n^{th}\) node (\(n=1, 2, 3, 4\)) transmits the computational result to IP of all other nodes in the broadcast mode by using one bus simultaneously. Nodes execute this procedure in sequence. For example:

T1 of OP of the \(1^{st}\) node transmits the computational result in instant of time \(t_1\) to other nodes in the broadcast mode by using one bus simultaneously;
T1 of OP of the \(2^{nd}\) node transmits the computational result in instant of time \(t_2\) to other nodes in the broadcast mode by using one bus simultaneously;
T1 of OP of the \(3^{rd}\) node transmits the computational result in instant of time \(t_3\) to other nodes in the broadcast mode by using one bus simultaneously;
T1 of OP of the \(4^{th}\) node transmits the computational result in instant of time \(t_4\) to other nodes in the broadcast mode by using one bus simultaneously;

On the other hand, IP of the \(n^{th}\) node (\(n=1, 2, 3, 4\)) receives the computational results from OP of the \(i^{th}\) nodes (\(i=1, 2, 3, 4\) and \(n\) \(\ne \) \(i\)) by using (N-1) bus in sequential order. All nodes execute this procedure in parallel. For example:

R1, R2 and R3 of IP of the \(1^{st}\) node receive the computational result from \(2^{nd}\), \(3^{rd}\) and \(4^{th}\) nodes in instants of times \(t_2\), \(t_3\) and \(t_4\) accordingly by using three different busses;
R1, R2 and R3 of IP of the \(2^{nd}\) node receive the computational result from \(1^{st}\), \(3^{rd}\) and \(4^{th}\) nodes in instants of times \(t_1\), \(t_3\) and \(t_4\) accordingly by using three different busses;
R1, R2 and R3 of IP of the \(3^{rd}\) node receive the computational result from \(1^{st}\), \(2^{nd}\) and \(4^{th}\) nodes in instants of times \(t_1\), \(t_2\) and \(t_4\) accordingly by using three different busses;
R1, R2 and R3 of IP of the \(4^{th}\) node receive the computational result from \(1^{st}\), \(2^{nd}\) and \(3^{rd}\) nodes in instants of times \(t_1\), \(t_2\) and \(t_3\) accordingly by using three different busses;

According to the assumption, three cases are also possible here. Discussions about the first and second cases are the same as for the previous protocol. The difference is in the third case in which the transmitter of faulty OP transmits the same computational results (it is not important whether correct or incorrect) because of the used broadcast mode.

Consequently, if a fault occurs in OP, the type of faults appearing in the RCS structure will only be Non-Byzantine.

As a result, we changed the connections between nodes and got RCS where only Non-Byzantine fault type might occur. Consequently, by changing connection modes between nodes we can block the occurrence of the Byzantine fault type in RCS.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Samet, R., Samet, N. (2017). Analysis of Relationship Between Modes of Intercomputer Communications and Fault Types in Redundant Computer Systems. In: Gavrilova, M., Tan, C. (eds) Transactions on Computational Science XXIX. Lecture Notes in Computer Science(), vol 10220. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-54563-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-662-54563-8_1
Published: 12 March 2017
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-54562-1
Online ISBN: 978-3-662-54563-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Buying options

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix I: Examples

Appendix I: Examples

Example 1

Example 2

Example 3

Example 4

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation