Abstract
Real-time embedded systems perform many important functions in the modern world. A standard way to tolerate faults in these systems is with Byzantine fault-tolerant (BFT) state machine replication (SMR), in which multiple replicas execute the same software and their outputs are compared by the actuators. Unfortunately, traditional BFT SMR protocols are slow, requiring replicas to exchange sensor data back and forth over multiple rounds in order to reach agreement before each execution. The state of the art in reducing the latency of BFT SMR is eager execution, in which replicas execute on data from different sensors simultaneously on different processor cores. However, this technique results in 3–5× higher computation overheads compared to traditional BFT SMR systems, significantly limiting schedulability.
We present CrossTalk, a new BFT SMR protocol that leverages the prevalence of redundant switched networks in embedded systems to reduce latency without added computation. The key idea is to use specific algorithms to move messages between redundant network planes (which many systems already possess) as the messages travel from the sensors to the replicas. As a result, CrossTalk can ensure agreement automatically in the network, avoiding the need for any communication between replicas. Our evaluation shows that CrossTalk improves schedulability by 2.13–4.24× over the state of the art. Moreover, in a NASA simulation of a real spaceflight mission, CrossTalk tolerates more faults than the state of the art while using nearly 3× less processor time.
- [1] 2008. Survivability of Systems.
Technical Report AC 25.795-7. Federal Aviation Administration.Google Scholar - [2] 2009. ARINC 664 P7: Aircraft data network part 7 avionics full-duplex switched ethernet network. ARINC.Google Scholar
- [3] 2010. Aitech’s New Customizable 3U CPCI Enclosure Combines Flexible Electronic Configurations with Rugged, Reliable Operation. https://picmg.mil-embedded.com/news/aitechs-configurations-rugged-reliable-operation/Google Scholar
- [4] 2015. TTEthernet Product Overview. http://konaka.com.tr/pdf/AS6802_TTEthernet.pdfGoogle Scholar
- [5] 2015. TTTech to Provide ARINC 664 p7 Products for Mission System on UK AW101 Merlin Mk4/4a Helicopters. https://www.tttech.com/press/tttech-to-provide-arinc-664-p7-products-for-mission-system-on-uk-aw101-merlin-mk4-4a-helicopters/Google Scholar
- [6] 2016. IEEE 802.1Qbv-2015: IEEE standard for local and metropolitan area networks – bridges and bridged networks – amendment 25: Enhancements for scheduled traffic. IEEE.Google Scholar
- [7] 2016. SAE AS6802: Time-triggered ethernet. SAE International.Google Scholar
- [8] 2017. IEEE 1588 Precise Time Protocol: The New Standard in Time Synchronization.
Technical Report . Microsemi.Google Scholar - [9] 2017. IEEE 802.1CB-2017: IEEE standard for local and metropolitan area networks – frame replication and elimination for reliability. Institute of Electrical and Electronics Engineers.Google Scholar
- [10] 2017. IEEE 802.3-2018: IEEE standard for ethernet. Institute of Electrical and Electronics Engineers.Google Scholar
- [11] 2019. Safe4RAIL-2 Newsletter. https://safe4rail.eu/downloads/Safe4RAIL-2-Newsletter-Issue-1-April-2019.pdfGoogle Scholar
- [12] 2022. Orion reference guide. https://www.nasa.gov/sites/default/files/atoms/files/orion_reference_guide_090622.pdf. NASA Johnson Space Center.Google Scholar
- [13] 2022. TSN Is Set to Become a Must for Industry. https://iebmedia.com/technology/tsn/tsn-is-set-to-become-a-must-for-industry/Google Scholar
- [14] 2023. ADLINK Technology MVP-5001. https://www.mouser.com/ProductDetail/ADLINK-Technology/MVP-5001?qs=pCidNA4Lr1nu0o3cwPnhIw%3D%3DGoogle Scholar
- [15] 2023. Auto/TSN for In-Vehicle Networking. https://www.missinglinkelectronics.com/www/index.php/menu-solutions/menu-autotsnGoogle Scholar
- [16] 2023. Cost of Space Launches to LEO. https://ourworldindata.org/grapher/cost-space-launches-low-earth-orbitGoogle Scholar
- [17] 2023. IEEE P802.1DP: TSN for aerospace onboard ethernet communications. IEEE / SAE International (joint standard).Google Scholar
- [18] 2023. Monoprice Cat6 1000ft Blue CMR Bulk Cable Shielded. https://www.monoprice.com/product?p_id=18608Google Scholar
- [19] 2023. SpaceWire Cable GNSSW10028MS. https://www.wiremasters.com/suppliers/W-L-Gore-And-Associates/catalog/products/wire-and-cable/w-l-gore-and-associates-inc-/Google Scholar
- [20] 2023. Time Sensitive Networking (TSN). https://us.profinet.com/digital/tsn/Google Scholar
- [21] 2023. TSN-6325-8T4S4X Industrial L3 8-Port Switch. https://planetechusa.com/product/tsn-6325-8t4s4x-industrial-l3-8-port-10-100-1000t-4-port-1g-2-5g-sfp-4-port-10gbase-x-sfp-managed-tsn-ethernet-switch/Google Scholar
- [22] 2021. Good-case latency of byzantine broadcast: A complete categorization. In Proc. PODC.Google Scholar
- [23] . 2016. Photo Essay: Inside the airbus A380 test aircraft F-WWDD MSN4. https://www.bangaloreaviation.com/2016/10/photo-essay-inside-airbus-a380-test-aircraft-f-wwdd-msn4.html. (2016).Google Scholar
- [24] . 2014. Network topology optimization for distributed integrated modular avionics. In Proc. DASC.Google Scholar
- [25] 2001. Assumption coverage under different failure modes in the time-triggered architecture. In Proc. ETFA.Google Scholar
- [26] 2020. Method and Computer System for Establishing an Interactive Consistency Property.Google Scholar
- [27] . 2015. TTEthernet avionics backbone a technology breakthrough for S-97 raider. https://www.aviationtoday.com/2015/07/20/ttethernet-avionics-backbone-a-technology-breakthrough-for-s-97-raider/. Aviation Today (2015).Google Scholar
- [28] 2018. A time synchronization protocol for A664 P7. In Proc. DASC.Google Scholar
- [29] 2016. Performance impact of the interactions between time-triggered and rate-constrained transmissions in TTEthernet. In Proc. ERTS.Google Scholar
- [30] . 2008. A Primer on Architectural Level Fault Tolerance.
Technical Report NASA/TM-2008-215108.Google Scholar - [31] . 2007. The airbus approach to open integrated modular avionics (IMA): Technology, methods, processes, and future roadmap. In Proc. AST.Google Scholar
- [32] 2015. NetPaxos: Consensus at network speed. In Proc. SOSR.Google Scholar
- [33] 2020. P4xos: Consensus as a network service. IEEE/ACM Trans. Netw. 28, 4 (2020).Google Scholar
- [34] . 2006. One-step consensus with zero-degradation. In Proc. DSN.Google Scholar
- [35] . 1982. The byzantine generals strike again. Journal of Algorithms 3, 1 (1982).Google ScholarCross Ref
- [36] 2003. Byzantine fault tolerance, from theory to reality. In Proc. SAFECOMP.Google Scholar
- [37] 2013. Application Agreement and Integration Services.
Technical Report NASA/CR–2013-217963.Google Scholar - [38] 1988. Consensus in the presence of partial synchrony. J. ACM 35, 2 (1988).Google Scholar
- [39] . 1985. Fault-tolerant routing in DeBruijn communication networks. IEEE Trans. Compu. C-34, 9 (1985).Google ScholarDigital Library
- [40] 2018. Radiation-tolerant system-on-chip (SOC) with deterministic ethernet switching for scalable modular launcher avionics. In Proc. ERTS.Google Scholar
- [41] . 1982. A lower bound for the time to assure interactive consistency. Inf.Process.Lett. 14, 4 (1982).Google ScholarCross Ref
- [42] 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (1985).Google Scholar
- [43] . 2006. Comparison of Communication Architectures for Spacecraft Modular Avionics Systems.
Technical Report NASA/TM-2006-214431.Google Scholar - [44] . 1994. A Modular Approach to Fault-Tolerant Broadcasts and Related Problems.
Technical Report .Google ScholarDigital Library - [45] 2008. Exploring network structure, dynamics, and function using networkx. In Proc. SciPy.Google Scholar
- [46] 2005. Ringing out fault tolerance. A new ring network for superior low-cost dependability. In Proc. DSN.Google Scholar
- [47] . 1989. Space Shuttle Avionics System.
Technical Report NASA SP-504.Google Scholar - [48] 2006. Dependability in avionics systems. In Digital Avionics: A Computing Perspective.Google Scholar
- [49] . 2011. Real-Time Systems: Design Principles for Distributed Embedded Applications. Springer US.Google ScholarCross Ref
- [50] 2007. Zyzzyva: Speculative byzantine fault tolerance. In Proc. SOSP.Google Scholar
- [51] . 2019. Space-grade CPUs: How Do You Send More Computing Power Into Space? https://arstechnica.com/science/2019/11/space-grade-cpus-how-do-you-send-more-computing-power-into-space/Google Scholar
- [52] . 1986. A byzantine resilient fault tolerant computer for nuclear power plant applications. In Proc. FTCS.Google Scholar
- [53] . 1994. Architectural principles for safety-critical real-time applications. Proc. IEEE 82, 1 (1994).Google ScholarCross Ref
- [54] 1982. The byzantine generals problem. TOPLAS 4, 3 (1982).Google Scholar
- [55] 2014. End-to-end latency and temporal consistency analysis in networked real-time systems. Int. J. Crit. Comput.-Based Syst. 5, 3/4 (2014).Google Scholar
- [56] 2016. Just say NO to paxos overhead: Replacing consensus with network ordering. In Proc. OSDI.Google Scholar
- [57] . 2015. On TTEthernet for integrated fault-tolerant spacecraft networks. In Proc. AIAA.Google Scholar
- [58] . 2016. Notional 1FT Voting Architecture with Time-Triggered Ethernet. https://ntrs.nasa.gov/citations/20170001652Google Scholar
- [59] . 2020. On Time-Triggered Ethernet in NASA’s Lunar Gateway. https://ntrs.nasa.gov/citations/20205005104Google Scholar
- [60] . 2022. Impact of Switch Plane Redundancy on Network Availability. https://ntrs.nasa.gov/citations/20220003523Google Scholar
- [61] 2021. IGOR: Accelerating byzantine fault tolerance for real-time systems with eager execution. In Proc. RTAS.Google Scholar
- [62] . 2017. Comparative analysis of present and future space-grade processors with device metrics. J. Aerosp. Inf. Syst. 14, 3 (2017).Google Scholar
- [63] 2021. Gatekeeper: A reliable reconfiguration protocol for real-time ethernet systems. In Proc. DASC.Google Scholar
- [64] . 1984. An upper and lower bound for clock synchronization. Inf. Control. 62, 2 (1984).Google Scholar
- [65] . 2009. Ares I avionics introduction. In Proc. NASA/ARMY Software and Systems Forum.Google Scholar
- [66] . 2006. Fast byzantine consensus. TDSC 3, 3 (2006).Google Scholar
- [67] . 2012. NASA/GSFC’s Flight Software Core Flight System. https://ntrs.nasa.gov/citations/20130013412Google Scholar
- [68] . 1988. Flip-trees: Fault-tolerant graphs with wide containers. IEEE Trans. Compu. 37, 4 (1988).Google ScholarDigital Library
- [69] . 2022. The role of ethernet in zonal architectures and automotive telematics. https://www.redeweb.com/en/Articles/el-papel-de-ethernet-en-las-arquitecturas-zonales-y-la-telematica-del-automovil/. (2022).Google Scholar
- [70] . 2011. Time-Triggered Communication.Google ScholarDigital Library
- [71] 2016. SpaceFibre networks: SpaceFibre, long paper. In Proc. International SpaceWire Conference.Google Scholar
- [72] . 2011. TTTech Company Overview. https://www.slideshare.net/TTTech/tttech-2011companyoverviewGoogle Scholar
- [73] . 2021. Do not overpay for fault tolerance!. In Proc. RTAS.Google Scholar
- [74] . 2001. Formal Verification of Transmission Window Timing for the Time-Triggered Architecture.
Technical Report . SRI International.Google Scholar - [75] . 2014. Dimensioning of civilian avionics networks. In Industrial Communication Technology Handbook.Google Scholar
- [76] . 2005. Fault-Tolerant Architectures for Space and Avionics Applications.Google Scholar
- [77] . 1997. Redundancy management software services for seawolf ship control system. In Proc. FTCS.Google Scholar
- [78] 2009. Zeno: Eventually consistent byzantine-fault tolerance. In Proc. NSDI.Google Scholar
- [79] . 2008. Bosco: One-step byzantine asynchronous consensus. In Proc. DISC.Google Scholar
- [80] . 2015. Fault-tolerant consensus in directed graphs. In Proc. PODC.Google Scholar
- [81] 2006. Communication integrity in networks for critical control systems. In Proc. EDCC.Google Scholar
- [82] et al. 2015. Communication integrity for future helicopter flight control systems. In Proc. DASC.Google Scholar
Index Terms
- CrossTalk: Making Low-Latency Fault Tolerance Cheap by Exploiting Redundant Networks
Recommendations
Multi-Threshold Byzantine Fault Tolerance
CCS '21: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications SecurityClassic Byzantine fault tolerant (BFT) protocols are designed for a specific timing model, most often one of the following: synchronous, asynchronous or partially synchronous. It is well known that the timing model and fault tolerance threshold present ...
Separating agreement from execution for byzantine fault tolerant services
SOSP '03We describe a new architecture for Byzantine fault tolerant state machine replication that separates agreement that orders requests from execution that processes requests. This separation yields two fundamental and practically significant advantages ...
Deterministic or probabilistic? - A survey on Byzantine fault tolerant state machine replication
Highlights- Network infrastructures and software systems are vulnerable to failures.
- Service replication is a solution that guarantees the service’s correct execution even in the presence of faults.
- The use of a consensus protocol is necessary ...
AbstractByzantine Fault tolerant (BFT) protocols are implemented to guarantee the correct system/application behavior even in the presence of arbitrary faults (i.e., Byzantine faults). Byzantine Fault tolerant State Machine Replication (BFT-SMR) is a ...
Comments