skip to main content
research-article

CrossTalk: Making Low-Latency Fault Tolerance Cheap by Exploiting Redundant Networks

Published:09 September 2023Publication History
Skip Abstract Section

Abstract

Real-time embedded systems perform many important functions in the modern world. A standard way to tolerate faults in these systems is with Byzantine fault-tolerant (BFT) state machine replication (SMR), in which multiple replicas execute the same software and their outputs are compared by the actuators. Unfortunately, traditional BFT SMR protocols are slow, requiring replicas to exchange sensor data back and forth over multiple rounds in order to reach agreement before each execution. The state of the art in reducing the latency of BFT SMR is eager execution, in which replicas execute on data from different sensors simultaneously on different processor cores. However, this technique results in 3–5× higher computation overheads compared to traditional BFT SMR systems, significantly limiting schedulability.

We present CrossTalk, a new BFT SMR protocol that leverages the prevalence of redundant switched networks in embedded systems to reduce latency without added computation. The key idea is to use specific algorithms to move messages between redundant network planes (which many systems already possess) as the messages travel from the sensors to the replicas. As a result, CrossTalk can ensure agreement automatically in the network, avoiding the need for any communication between replicas. Our evaluation shows that CrossTalk improves schedulability by 2.13–4.24× over the state of the art. Moreover, in a NASA simulation of a real spaceflight mission, CrossTalk tolerates more faults than the state of the art while using nearly 3× less processor time.

REFERENCES

  1. [1] 2008. Survivability of Systems. Technical Report AC 25.795-7. Federal Aviation Administration.Google ScholarGoogle Scholar
  2. [2] 2009. ARINC 664 P7: Aircraft data network part 7 avionics full-duplex switched ethernet network. ARINC.Google ScholarGoogle Scholar
  3. [3] 2010. Aitech’s New Customizable 3U CPCI Enclosure Combines Flexible Electronic Configurations with Rugged, Reliable Operation. https://picmg.mil-embedded.com/news/aitechs-configurations-rugged-reliable-operation/Google ScholarGoogle Scholar
  4. [4] 2015. TTEthernet Product Overview. http://konaka.com.tr/pdf/AS6802_TTEthernet.pdfGoogle ScholarGoogle Scholar
  5. [5] 2015. TTTech to Provide ARINC 664 p7 Products for Mission System on UK AW101 Merlin Mk4/4a Helicopters. https://www.tttech.com/press/tttech-to-provide-arinc-664-p7-products-for-mission-system-on-uk-aw101-merlin-mk4-4a-helicopters/Google ScholarGoogle Scholar
  6. [6] 2016. IEEE 802.1Qbv-2015: IEEE standard for local and metropolitan area networks – bridges and bridged networks – amendment 25: Enhancements for scheduled traffic. IEEE.Google ScholarGoogle Scholar
  7. [7] 2016. SAE AS6802: Time-triggered ethernet. SAE International.Google ScholarGoogle Scholar
  8. [8] 2017. IEEE 1588 Precise Time Protocol: The New Standard in Time Synchronization. Technical Report. Microsemi.Google ScholarGoogle Scholar
  9. [9] 2017. IEEE 802.1CB-2017: IEEE standard for local and metropolitan area networks – frame replication and elimination for reliability. Institute of Electrical and Electronics Engineers.Google ScholarGoogle Scholar
  10. [10] 2017. IEEE 802.3-2018: IEEE standard for ethernet. Institute of Electrical and Electronics Engineers.Google ScholarGoogle Scholar
  11. [11] 2019. Safe4RAIL-2 Newsletter. https://safe4rail.eu/downloads/Safe4RAIL-2-Newsletter-Issue-1-April-2019.pdfGoogle ScholarGoogle Scholar
  12. [12] 2022. Orion reference guide. https://www.nasa.gov/sites/default/files/atoms/files/orion_reference_guide_090622.pdf. NASA Johnson Space Center.Google ScholarGoogle Scholar
  13. [13] 2022. TSN Is Set to Become a Must for Industry. https://iebmedia.com/technology/tsn/tsn-is-set-to-become-a-must-for-industry/Google ScholarGoogle Scholar
  14. [14] 2023. ADLINK Technology MVP-5001. https://www.mouser.com/ProductDetail/ADLINK-Technology/MVP-5001?qs=pCidNA4Lr1nu0o3cwPnhIw%3D%3DGoogle ScholarGoogle Scholar
  15. [15] 2023. Auto/TSN for In-Vehicle Networking. https://www.missinglinkelectronics.com/www/index.php/menu-solutions/menu-autotsnGoogle ScholarGoogle Scholar
  16. [16] 2023. Cost of Space Launches to LEO. https://ourworldindata.org/grapher/cost-space-launches-low-earth-orbitGoogle ScholarGoogle Scholar
  17. [17] 2023. IEEE P802.1DP: TSN for aerospace onboard ethernet communications. IEEE / SAE International (joint standard).Google ScholarGoogle Scholar
  18. [18] 2023. Monoprice Cat6 1000ft Blue CMR Bulk Cable Shielded. https://www.monoprice.com/product?p_id=18608Google ScholarGoogle Scholar
  19. [19] 2023. SpaceWire Cable GNSSW10028MS. https://www.wiremasters.com/suppliers/W-L-Gore-And-Associates/catalog/products/wire-and-cable/w-l-gore-and-associates-inc-/Google ScholarGoogle Scholar
  20. [20] 2023. Time Sensitive Networking (TSN). https://us.profinet.com/digital/tsn/Google ScholarGoogle Scholar
  21. [21] 2023. TSN-6325-8T4S4X Industrial L3 8-Port Switch. https://planetechusa.com/product/tsn-6325-8t4s4x-industrial-l3-8-port-10-100-1000t-4-port-1g-2-5g-sfp-4-port-10gbase-x-sfp-managed-tsn-ethernet-switch/Google ScholarGoogle Scholar
  22. [22] al. Ittai Abraham et2021. Good-case latency of byzantine broadcast: A complete categorization. In Proc. PODC.Google ScholarGoogle Scholar
  23. [23] Agarwal Vedant. 2016. Photo Essay: Inside the airbus A380 test aircraft F-WWDD MSN4. https://www.bangaloreaviation.com/2016/10/photo-essay-inside-airbus-a380-test-aircraft-f-wwdd-msn4.html. (2016).Google ScholarGoogle Scholar
  24. [24] Annighoefer Bjoern. 2014. Network topology optimization for distributed integrated modular avionics. In Proc. DASC.Google ScholarGoogle Scholar
  25. [25] al. Günther Bauer et2001. Assumption coverage under different failure modes in the time-triggered architecture. In Proc. ETFA.Google ScholarGoogle Scholar
  26. [26] al. Günther Bauer et2020. Method and Computer System for Establishing an Interactive Consistency Property.Google ScholarGoogle Scholar
  27. [27] Bellamy Woodrow. 2015. TTEthernet avionics backbone a technology breakthrough for S-97 raider. https://www.aviationtoday.com/2015/07/20/ttethernet-avionics-backbone-a-technology-breakthrough-for-s-97-raider/. Aviation Today (2015).Google ScholarGoogle Scholar
  28. [28] al. Frédéric Boulanger et2018. A time synchronization protocol for A664 P7. In Proc. DASC.Google ScholarGoogle Scholar
  29. [29] al. Marc Boyer et2016. Performance impact of the interactions between time-triggered and rate-constrained transmissions in TTEthernet. In Proc. ERTS.Google ScholarGoogle Scholar
  30. [30] Butler Ricky. 2008. A Primer on Architectural Level Fault Tolerance. Technical Report NASA/TM-2008-215108.Google ScholarGoogle Scholar
  31. [31] Butz Henning. 2007. The airbus approach to open integrated modular avionics (IMA): Technology, methods, processes, and future roadmap. In Proc. AST.Google ScholarGoogle Scholar
  32. [32] al. Huynh Tu Dang et2015. NetPaxos: Consensus at network speed. In Proc. SOSR.Google ScholarGoogle Scholar
  33. [33] al. Huynh Tu Dang et2020. P4xos: Consensus as a network service. IEEE/ACM Trans. Netw. 28, 4 (2020).Google ScholarGoogle Scholar
  34. [34] Dobre Dan. and Suri Neeraj. 2006. One-step consensus with zero-degradation. In Proc. DSN.Google ScholarGoogle Scholar
  35. [35] Dolev Danny. 1982. The byzantine generals strike again. Journal of Algorithms 3, 1 (1982).Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] al. Kevin Driscoll et2003. Byzantine fault tolerance, from theory to reality. In Proc. SAFECOMP.Google ScholarGoogle Scholar
  37. [37] al. Kevin Driscoll et2013. Application Agreement and Integration Services. Technical Report NASA/CR–2013-217963.Google ScholarGoogle Scholar
  38. [38] al. Cynthia Dwork et1988. Consensus in the presence of partial synchrony. J. ACM 35, 2 (1988).Google ScholarGoogle Scholar
  39. [39] Esfahanian Abdol-Hossein and Hakimi Seifollah. 1985. Fault-tolerant routing in DeBruijn communication networks. IEEE Trans. Compu. C-34, 9 (1985).Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] al. Christian Fidi et2018. Radiation-tolerant system-on-chip (SOC) with deterministic ethernet switching for scalable modular launcher avionics. In Proc. ERTS.Google ScholarGoogle Scholar
  41. [41] Fischer Michael and Lynch Nancy. 1982. A lower bound for the time to assure interactive consistency. Inf.Process.Lett. 14, 4 (1982).Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] al. Michael Fischer et1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (1985).Google ScholarGoogle Scholar
  43. [43] Gwaltney David and Briscoe J.M.. 2006. Comparison of Communication Architectures for Spacecraft Modular Avionics Systems. Technical Report NASA/TM-2006-214431.Google ScholarGoogle Scholar
  44. [44] Hadzilacos Vassos and Toueg Sam. 1994. A Modular Approach to Fault-Tolerant Broadcasts and Related Problems. Technical Report.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] al. Aric Hagberg et2008. Exploring network structure, dynamics, and function using networkx. In Proc. SciPy.Google ScholarGoogle Scholar
  46. [46] al. Brendan Hall et2005. Ringing out fault tolerance. A new ring network for superior low-cost dependability. In Proc. DSN.Google ScholarGoogle Scholar
  47. [47] Hanaway John and Moorehead Robert. 1989. Space Shuttle Avionics System. Technical Report NASA SP-504.Google ScholarGoogle Scholar
  48. [48] al. John Knight et2006. Dependability in avionics systems. In Digital Avionics: A Computing Perspective.Google ScholarGoogle Scholar
  49. [49] Kopetz Hermann. 2011. Real-Time Systems: Design Principles for Distributed Embedded Applications. Springer US.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] al. Ramakrishna Kotla et2007. Zyzzyva: Speculative byzantine fault tolerance. In Proc. SOSP.Google ScholarGoogle Scholar
  51. [51] Krywko Jacek. 2019. Space-grade CPUs: How Do You Send More Computing Power Into Space? https://arstechnica.com/science/2019/11/space-grade-cpus-how-do-you-send-more-computing-power-into-space/Google ScholarGoogle Scholar
  52. [52] Lala Jaynarayan. 1986. A byzantine resilient fault tolerant computer for nuclear power plant applications. In Proc. FTCS.Google ScholarGoogle Scholar
  53. [53] Lala Jaynarayan and Harper Richard. 1994. Architectural principles for safety-critical real-time applications. Proc. IEEE 82, 1 (1994).Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] al. Leslie Lamport et1982. The byzantine generals problem. TOPLAS 4, 3 (1982).Google ScholarGoogle Scholar
  55. [55] al. Michaël Lauer et2014. End-to-end latency and temporal consistency analysis in networked real-time systems. Int. J. Crit. Comput.-Based Syst. 5, 3/4 (2014).Google ScholarGoogle Scholar
  56. [56] al. Jialin Li et2016. Just say NO to paxos overhead: Replacing consensus with network ordering. In Proc. OSDI.Google ScholarGoogle Scholar
  57. [57] Loveless Andrew. 2015. On TTEthernet for integrated fault-tolerant spacecraft networks. In Proc. AIAA.Google ScholarGoogle Scholar
  58. [58] Loveless Andrew. 2016. Notional 1FT Voting Architecture with Time-Triggered Ethernet. https://ntrs.nasa.gov/citations/20170001652Google ScholarGoogle Scholar
  59. [59] Loveless Andrew. 2020. On Time-Triggered Ethernet in NASA’s Lunar Gateway. https://ntrs.nasa.gov/citations/20205005104Google ScholarGoogle Scholar
  60. [60] Loveless Andrew. 2022. Impact of Switch Plane Redundancy on Network Availability. https://ntrs.nasa.gov/citations/20220003523Google ScholarGoogle Scholar
  61. [61] al. Andrew Loveless et2021. IGOR: Accelerating byzantine fault tolerance for real-time systems with eager execution. In Proc. RTAS.Google ScholarGoogle Scholar
  62. [62] Lovelly Tyler and George Alan. 2017. Comparative analysis of present and future space-grade processors with device metrics. J. Aerosp. Inf. Syst. 14, 3 (2017).Google ScholarGoogle Scholar
  63. [63] al. Brendan Luksik et2021. Gatekeeper: A reliable reconfiguration protocol for real-time ethernet systems. In Proc. DASC.Google ScholarGoogle Scholar
  64. [64] Lundelius Jennifer and Lynch Nancy. 1984. An upper and lower bound for clock synchronization. Inf. Control. 62, 2 (1984).Google ScholarGoogle Scholar
  65. [65] Marchant Christopher. 2009. Ares I avionics introduction. In Proc. NASA/ARMY Software and Systems Forum.Google ScholarGoogle Scholar
  66. [66] Martin Jean-Philippe and Alvisi Lorenzo. 2006. Fast byzantine consensus. TDSC 3, 3 (2006).Google ScholarGoogle Scholar
  67. [67] McComas David. 2012. NASA/GSFC’s Flight Software Core Flight System. https://ntrs.nasa.gov/citations/20130013412Google ScholarGoogle Scholar
  68. [68] Meyer F.J.. 1988. Flip-trees: Fault-tolerant graphs with wide containers. IEEE Trans. Compu. 37, 4 (1988).Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Neuenhueskes Klaus. 2022. The role of ethernet in zonal architectures and automotive telematics. https://www.redeweb.com/en/Articles/el-papel-de-ethernet-en-las-arquitecturas-zonales-y-la-telematica-del-automovil/. (2022).Google ScholarGoogle Scholar
  70. [70] Obermaisser Roman. 2011. Time-Triggered Communication.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. [71] al. Steve Parkes et2016. SpaceFibre networks: SpaceFibre, long paper. In Proc. International SpaceWire Conference.Google ScholarGoogle Scholar
  72. [72] Plankensteiner Markus. 2011. TTTech Company Overview. https://www.slideshare.net/TTTech/tttech-2011companyoverviewGoogle ScholarGoogle Scholar
  73. [73] Roth Edo and Haeberlen Andreas. 2021. Do not overpay for fault tolerance!. In Proc. RTAS.Google ScholarGoogle Scholar
  74. [74] Rushby John. 2001. Formal Verification of Transmission Window Timing for the Time-Triggered Architecture. Technical Report. SRI International.Google ScholarGoogle Scholar
  75. [75] Scharbarg Jean-Luc and Fraboul Christian. 2014. Dimensioning of civilian avionics networks. In Industrial Communication Technology Handbook.Google ScholarGoogle Scholar
  76. [76] Siewiorek Daniel and Narasimhan Priya. 2005. Fault-Tolerant Architectures for Space and Avionics Applications.Google ScholarGoogle Scholar
  77. [77] Sims J. T.. 1997. Redundancy management software services for seawolf ship control system. In Proc. FTCS.Google ScholarGoogle Scholar
  78. [78] al. Atul Singh et2009. Zeno: Eventually consistent byzantine-fault tolerance. In Proc. NSDI.Google ScholarGoogle Scholar
  79. [79] Song Yee Jiun and Renesse Robbert. 2008. Bosco: One-step byzantine asynchronous consensus. In Proc. DISC.Google ScholarGoogle Scholar
  80. [80] Tseng Lewis and Vaidya Nitin. 2015. Fault-tolerant consensus in directed graphs. In Proc. PODC.Google ScholarGoogle Scholar
  81. [81] al. Anis Youssef et2006. Communication integrity in networks for critical control systems. In Proc. EDCC.Google ScholarGoogle Scholar
  82. [82] Zammali Amira et al. 2015. Communication integrity for future helicopter flight control systems. In Proc. DASC.Google ScholarGoogle Scholar

Index Terms

  1. CrossTalk: Making Low-Latency Fault Tolerance Cheap by Exploiting Redundant Networks

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Embedded Computing Systems
            ACM Transactions on Embedded Computing Systems  Volume 22, Issue 5s
            Special Issue ESWEEK 2023
            October 2023
            1394 pages
            ISSN:1539-9087
            EISSN:1558-3465
            DOI:10.1145/3614235
            • Editor:
            • Tulika Mitra
            Issue’s Table of Contents

            Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 9 September 2023
            • Accepted: 13 July 2023
            • Revised: 2 June 2023
            • Received: 23 March 2023
            Published in tecs Volume 22, Issue 5s

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
          • Article Metrics

            • Downloads (Last 12 months)233
            • Downloads (Last 6 weeks)30

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text