Abstract
Replication is a standard technique for fault-tolerance in distributed systems modeled as deterministic finite state machines (DFSMs or machines). To correct f crash faults among n machines, replication requires nf additional backup machines. We present a fusion-based solution that requires just f additional backup machines (called fusions or fused backups). In this paper, we first propose a fundamental problem regarding DFSMs, independent of fault tolerance, that has not been explored in the literature so far: Given a machine M, with a set of states and a set of events, can we replace it with machines each containing fewer events than M? To formalize this we define a (k,e)-event decomposition of a given machine M, that is a set of k machines each with at least e events fewer than the event set of M, that acting in parallel, are equivalent to M. We present an algorithm to generate such machines with time complexity O(|X M |3|Σ M |e), where X M is the set of states and Σ M the set of events of M. Second, we use our event decomposition algorithm to generate fused backups that can correct faults among a given set of machines. We show that these backups are minimal w.r.t the number of states they contain and the number of events in their event set. Third, we use the notion of locality sensitive hashing to present algorithms for the detection and correction of faults for the fusion-based solution. The algorithm for the detection of Byzantine faults has time complexity O(n f) on average, which is the same as that for replication. The algorithm for the correction of both crash and Byzantine faults has time complexity O(n ρf) with high probability (w.h.p), where ρ is the average state reduction achieved by fusion. We show that for small values of n (for most practical systems, n < 10) and ρ (average value of ρ < 2 in our experiments), this results in almost no overhead as compared to replication. Finally, we evaluate fusion on the widely used MCNC’91 benchmarks for DFSMs and results show that the average state space savings in fusion (over replication) is 38% (range 0-99%), while the average event-reduction is 4% (range 0-45%).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
Balasubramanian, B., Garg, V.K.: A fusion-based approach for handling multiple faults in data structures. Technical Report ECE-PDS-2009-001, Parallel and Distributed Systems Laboratory, ECE Dept. University of Texas at Austin (2009)
Balasubramanian, B., Garg, V.K.: Fsm backup library (implemented in java 1.6). In: Parallel and Distributed Systems Laboratory (2011), http://maple.ece.utexas.edu
Balasubramanian, B., Garg, V.K.: A report on fused state machines for fault tolerance in distributed systems. Technical Report TR-PDS-2011-002 Parallel and Distributed Systems Laboratory, The University of Texas at Austin (2011), http://pdsl.ece.utexas.edu/TechReports/2011/TR-PDS-2011-002.pdf
Chen, P.M., Lee, E.K., Gibson, G.A., Katz, R.H., Patterson, D.A.: Raid: high-performance, reliable secondary storage. ACM Comput. Surv. 26(2), 145–185 (1994)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Fischer, M.J., Lynch, N., Paterson, M.: Impossibility of distributed consensus with one faulty process. Journal of the ACM 32(2) (April 1985)
Garg, V.K.: Implementing Fault-Tolerant Services Using State Machines: Beyond Replication. In: Lynch, N.A., Shvartsman, A.A. (eds.) DISC 2010. LNCS, vol. 6343, pp. 450–464. Springer, Heidelberg (2010)
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB 1999: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 518–529. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Hamming, R.: Error-detecting and error-correcting codes. Bell System Technical Journal 29(2), 147–160 (1950)
Hartmanis, J., Stearns, R.E.: Algebraic structure theory of sequential machines. Prentice-Hall international series in applied mathematics. Prentice-Hall, Inc., Upper Saddle River (1966)
Hopcroft, J.E.: An n log n algorithm for minimizing states in a finite automaton. Technical report, Stanford, CA, USA (1971)
Huffman, D.A.: The synthesis of sequential switching circuits. Technical report, Massachusetts, USA (1954)
Lamport, L.: The implementation of reliable distributed multiprocess systems. Computer Networks 22, 95–114 (1978)
Lamport, L., Shostak, R., Pease, M.: The byzantine generals problem. ACM Transactions on Programming Languages and Systems 4, 382–401 (1982)
Lee, D., Yannakakis, M.: Closed partition lattice and machine decomposition. IEEE Trans. Comput. 51(2), 216–228 (2002)
Mishchenko, A., Chatterjee, S., Brayton, R.: Dag-aware aig rewriting: A fresh look at combinational logic synthesis. In: DAC 2006: Proceedings of the 43rd Annual Conference on Design Automation, pp. 532–536. ACM Press (2006)
Ogale, V., Balasubramanian, B., Garg, V.K.: A fusion-based approach for tolerating faults in finite state machines. In: International Parallel and Distributed Processing Symposium, pp. 1–11 (2009)
Patterson, D.A., Gibson, G., Katz, R.H.: A case for redundant arrays of inexpensive disks (raid). In: SIGMOD 1988: Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, pp. 109–116. ACM Press, New York (1988)
Pease, M., Lamport, L.: Reaching agreement in the presence of faults. Journal of the ACM 27, 228–234 (1980)
Schneider, F.B.: Byzantine generals in action: implementing fail-stop processors. ACM Trans. Comput. Syst. 2(2), 145–154 (1984)
Schneider, F.B.: Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys 22(4), 299–319 (1990)
Tenzakhti, F., Day, K., Ould-Khaoua, M.: Replication algorithms for the world-wide web. J. Syst. Archit. 50(10), 591–605 (2004)
Yang, S.: Logic synthesis and optimization benchmarks user guide version 3.0 (1991)
Youra, H., Inoue, T., Masuzawa, T., Fujiwara, H.: On the synthesis of synchronizable finite state machines with partial scan. Systems and Computers in Japan 29(1), 53–62 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Balasubramanian, B., Garg, V.K. (2011). Fused State Machines for Fault Tolerance in Distributed Systems. In: Fernàndez Anta, A., Lipari, G., Roy, M. (eds) Principles of Distributed Systems. OPODIS 2011. Lecture Notes in Computer Science, vol 7109. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25873-2_19
Download citation
DOI: https://doi.org/10.1007/978-3-642-25873-2_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25872-5
Online ISBN: 978-3-642-25873-2
eBook Packages: Computer ScienceComputer Science (R0)