Zoro: A robotic middleware combining high performance and high reliability

https://doi.org/10.1016/j.jpdc.2022.04.010Get rights and content

Highlights

  • We propose a shared memory pool architecture for high performance transportation.

  • We propose a socket based control algorithm to improve communication reliability.

  • We propose a hierarchical memory protection to improve communication safety.

  • We propose a novel high efficiency and high reliability service discovery method.

  • We implement Zoro, in the middleware layer of ROS2 to validate its efficiency.

Abstract

With the significant advances of AI technology, robotic systems have achieved remarkable development and profound effects. The robotic middleware plays a critical role in robotic systems to provide message definition, data transmission, and service discovery functions. As the modern robotic systems are usually operated in safety critical environments, such as various autonomous driving scenarios, it requires the design of robotic middleware to combine both high performance and high reliability in order to achieve system reliability and product safety. However, conventional robotic middleware used in the majority of robotic systems is based on an inefficient socket communication mechanism and relies on a hasty service discovery design, which leads to system instability and high resource usage. In this work, we propose a sophisticated robotic middleware, Zoro, to fulfill both high performance and high reliability. For communication, we employ shared memory for performance improvement and propose a socket-based communication control algorithm to improve reliability during data transmission. Also, a hierarchical memory protection mechanism is proposed to address safety problems caused by shared memory. Furthermore, we design a light-weight service discovery, to achieve high performance and a weak centralized mechanism for high reliability. Experiments show the communication latency of Zoro significantly outperforms state-of-the-art robotic middleware such as ROS2 and CyberRT by up to 41%. In terms of service discovery, Zoro reduces CPU usage by up to 44% compared to ROS. Zoro achieves reliability with respect to communication and service discovery.

Introduction

Benefiting from the rising AI technology [20], [26], intelligent autonomous systems and their applications are growing on a rapid pace [33]. In particular, both performance and reliability are critical characteristics in autonomous navigation systems as they generate a large amount of data from their surrounding environment at runtime. In addition, they require underlying communication middleware to transmit and process data in real-time without any system failure data loss. As a core infrastructure, middleware provides message definition, data transmission, and service discovery functions. Message definition provides pre-defined message types valid in the robotic systems, which can be communicated between modules with data transmission function. To allow modules with the same topic to communicate each other, service discovery notifies the modules when the others start. We note that conventional middleware cannot provide both high performance and high reliability in terms of data transmission and service discovery functions.

Modules in robotic system are usually coarse-grained organized by processes. In such a function-independent and resource isolation system design, a process crash in one module generally should not impact other modules or the entire system. Thus, data transmission between different modules usually relies on Inter-Process Communication (IPC) mechanism. Taking a typical autonomous navigation as an example, a planning process receives messages from perception and localization processes, and further sends planning results to other processes such as vehicle control module.

To provide message passing support for autonomous navigation applications, robotic communication middleware [10], such as ROS [31], ROS2 [2], [23] and CyberRT [3], has been widely used due to its high compatibility and extensibility. Most of robotic communication middleware employs socket-based IPC methods [7] for its reliability, since it provides a loosely coupled communication model, where both sender (publisher) and receiver (subscriber) are decoupled and connected with socket communication. However, such a socket communication mechanism is not satisfied in scenarios of large message transmission between multiple receivers (or subscribers), where communication latency increases linearly with the growth of the message size [23], [21].

Therefore, shared memory based IPC methods are proposed to improve the performance. Shared memory allows two or more processes to share a given region of memory, which is the most efficient way of IPC, especially in scenarios of multiple receivers. It is because the data do not need to be copied. Furthermore, memory throughput can be largely reduced due to less memory copy.

Although the shared memory based methods provide high performance and less memory access, it often leads to system safety problems. First, to avoid race condition and control different processes to read and write the same data in shared memory, read-write locks are necessary to be utilized. In particular, if one process crashes while it is holding the read-write lock, the shared memory may be destroyed unexpectedly and the other processes that need to read or write that shared data are blocked due to the failure of lock acquiring. The tightly coupled communication model causes Crash Safety Issue. Second, the shared memory based methods typically adopt a ring buffer to manage data for communication. The sender (publisher) writes the data into the ring buffer in turn and the outdated data are replaced once new data are written in that ring buffer. Therefore, this model cannot guarantee that all the receivers (subscribers) can successfully read the data before they are replaced. In this case, it leads to Data Reliability Issue, especially in processes that require very high reliability in receiving the data in stable frequency. Third, as an efficient IPC method, shared memory allows peer processes to directly access and modify the shared memory buffer, leading to potential invalid memory access and data corruption. This problem is defined as Memory Protection Issue. The above issues hinder to achieve high reliability of robotic systems.

In addition to communication, service discovery is another crucial aspect to achieve high performance and high reliability. Service discovery allows modules updating their states and maintains the data-flow graph between modules. Although service discovery provides awareness between modules with the same topic with notifications, traditional middleware lacks either high performance or high reliability. For example, ROS implements service discovery with a centralized organization. This centralized method simplifies the module notifications and improves resource usage, but leads to the unreliability of the system, as the single crash of the service discovery incurs the crash of the whole systems. In comparison, ROS2 improves system reliability by switching from centralized notification to de-centralized approach. With de-centralized approach, however, the resource usage rises sharply with the increasing number of modules, because every module will broadcast its status to all other modules. This peer-to-peer notification significantly increases communication traffic and limits its applicability for robotic systems (such as autonomous system) consisting a large amount of modules. Thus, for service discovery, both efficiency and reliability should be taken into account.

We summarize the performance and reliability about communication and service discovery for different robotic middleware. As shown in Table 1, the conventional robotic middleware whether lacks performance/reliability for communication or lacks performance/reliability for service discovery. To the best of our knowledge, none of the conventional middleware provides both high performance, namely low communication latency, and high reliability that covers low data loss rate during communication and low interference between processes when some crash.

In this paper, we design a framework leveraging shared memory and reliable notification mechanism to provide high performance and high reliability data communication. And the middleware exploits light weight notification and decoupled design to reduce resource usage. We have four contributions in this work.

  • We study performance and system safety problems in conventional robotic middleware, and propose a shared memory pool architecture for high performance transportation.

  • We propose a socket based control algorithm and hierarchical memory access protection to improve the reliability and safety for data communication.

  • We propose a novel service discovery, which provides both high efficiency and high reliability.

  • Based on our proposed techniques, we implement a robotic middleware, Zoro, in the middleware layer of ROS2. The evaluation results in a realistic workflow show that (1) Zoro is able to gain up to 41% performance improvement on average over its counterparts. And the memory throughput can be reduced from 17 GB/s to 5 GB/s in a typical workflow. (2) Zoro achieved crash safety and reduces 5.2% data missing rate compared with CyberRT. (3) Zoro can reduce 44% CPU usage compared with ROS in service discovery.

Section snippets

Robotic middleware

Robotic middleware, such as ROS2, employs computing hosts to organize key components in a distributed environment. In robotic middleware, we usually use Node to denote a function unit or an application module that launched in the system while we use Machine to denote a physical computing unit such as an X86 server or a SoC. Nodes are composed of independent computing processes and contribute to rapid development, resource isolation, function modularity, and code extensibility. A typical

Shared memory based data transmission

Based on the study of socket IPC methods and shared memory methods in autonomous navigation applications, we conclude that a robotic communication middleware with high performance and high reliability is crucial in designing a latency-sensitive and safety-critical system.

Our goal is to design a combined data transmission mechanism, where shared memory can provide high performance data transmission and socket can provide high reliable control algorithm in inter-process communication. In this

Socket based communication control

Based on the shared memory pool architecture, we design to utilize socket mechanism to control inter-process communication. We propose a socket based communication control algorithm to improve the reliability by avoiding using of the shared variables, such as read-write lock, in shared memory.

Service discovery and failover strategy

To overcome reliability problems of the system and improve the efficiency of service discovery, we propose a fault tolerance design and a novel service registration and notification mechanism. In this section, we introduce our design in service discovery and faulty tolerance.

Robotic middleware design

Fig. 6 shows the overview of our proposed robotic middleware, called Zoro. In detail, it shows the discovery workflow that facilitates creating communication between publishers and subscribers, and the communication workflow of transferring data.

For discovery, there is one individual discovery process responsible for tracking all module status in the system. Once one module is launched, the middleware of the module communicates with the discovery process and registers itself. Receiving the new

Evaluations

In this Section, we evaluate the performance and reliability of our proposed Zoro.

Related work

A Robotic system consists of many modules responsible for different functionality. To cooperate, data are transferred between these modules with Remote Procedure Call (RPC) pattern or Publisher Subscriber pattern. Although MPI has been largely used in HPC domain, its coupled module management that requires all modules online cannot satisfy robotic applications. A number of prior work [31], [2], [32], [13] use socket based communication. This method unifies communication on single server and

Conclusions

In this work, we propose a robotic middleware involving communication and service discovery. For communication, we propose a novel shared memory pool architecture for transportation and a control algorithm, which utilizes socket for controlling. And we proposed a hierarchical memory access protection model to overcome the memory access challenge for shared memory. By combining this design and techniques, we achieve the high performance and high reliability communication. Moreover, we propose a

CRediT authorship contribution statement

Wei Liu: Conceptualization, Writing-original draft preparation. Jiangming Jin: Supervision, Conceptualization, Methodology, Writing-reviewing, and editing. Hao Wu: Investigation, Writing-reviewing and editing. Yifan Gong: Investigation, Software. Ziyue Jiang: Software, Validation. Jidong Zhai: Supervision, Writing-reviewing and editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Wei Liu received the B.E. degree from Ocean University of China, P.R.C., in 2013, and the Ph.D. degree from University of Chinese Academy of Sciences in 2018. Currently, he is an software research engineer at TuSimple. His research interests include scientific computing acceleration, robotic system and high performance computing.

References (38)

  • Jesús Martínez Cruz

    A DDS-based middleware for quality-of-service and high-performance networked robotics

    Concurr. Comput., Pract. Exp.

    (2012)
  • A. Elkady et al.

    Robotics middleware: a comprehensive literature survey and attribute-based bibliography

    J. Robot.

    (2012)
  • Jinyu Gu

    Harmonizing performance and isolation in microkernels with efficient intra-kernel isolation and communication

  • H. Howard et al.

    Raft refloated: do we have consensus?

    ACM SIGOPS Oper. Syst. Rev.

    (2015)
  • Albert S. Huang et al.

    LCM: lightweight communications and marshalling

  • Woochul Kang et al.

    RDDS: a real-time data distribution service for cyber-physical systems

    IEEE Trans. Ind. Inform.

    (2012)
  • Jackie Kay et al.

    Real-Time Control in ROS and ROS 2.0

    (2015)
  • C. Kjellqvist et al.

    Safe, fast sharing of memcached as a protected library

  • L. Lamport

    Paxos made simple

    ACM SIGACT News

    (2001)
  • Cited by (3)

    Wei Liu received the B.E. degree from Ocean University of China, P.R.C., in 2013, and the Ph.D. degree from University of Chinese Academy of Sciences in 2018. Currently, he is an software research engineer at TuSimple. His research interests include scientific computing acceleration, robotic system and high performance computing.

    Jiangming Jin graduated from Singapore Nanyang Technological University in 2013 and obtained his PhD degree in Computer Engineering. Before that, he obtained his Bachelor's degree in University of Electronic Science and Technology of China in 2008. Dr. Jiangming Jin started his career in J. P. Morgan Singapore and Beijing Offices. In J. P. Morgan, he worked with the most sophisticated and large-scale financial computing system and also worked with JPM Credit Portfolio Group in credit derivative calculations. Dr. Jin begins a venture journey in autonomous driving at TuSimple in 2017. Dr. Jiangming Jin has wide and deep knowledge in Hi-Tech areas including Fintech, AI Chips and Software, Big Data and Machine Learning Systems, 5G and IoT. Dr. Jin also served as session-chair or keynote-speaker in several AI Summit/Forum. Dr. Jin holds several advisory expert positions in Universities and Non-Profit Institute such as Shanghai Tech University, Tianjin University, and Shenzhen Fintech Association.

    Hao Wu received the Ph.D. degree from Friedrich-Alexander-Universität Erlangen-Nürnberg, Master degree from University of Science and Technology of China, and Bachelor degree from Henan Normal University, respectively. He is currently a research engineer at TuSimple. His research interests include programming model on heterogeneous architectures and autonomous driving techniques.

    Yifan Gong received the bachelor's degree from School of Mathematic from Peking University (2006-2010) and obtained his Ph.D. degree from Interdisciplinary Graduate School and School of Computer Engineering of Nanyang Technological University in 2017. He is currently a research engineer in autonomous driving at TuSimple.

    Ziyue Jiang received the B.E. degree in computer science from Shanghai Jiao Tong University in 2019. He is currently a software engineer at TuSimple. His research interests include autonomous driving techniques and high performance computing.

    Jidong Zhai received his Ph.D. degree from Tsinghua University in 2010. He is an associate professor at the Department of Computer Science and Technology, Tsinghua University. His research focuses on high performance computing, especially performance analysis and optimization for large-scale parallel applications and performance evaluation for computer systems. He is currently on the editorial boards of IEEE Transactions on Computers (TC), IEEE Transactions on Parallel and Distributed Systems (TPDS), IEEE Transactions on Cloud Computing (TCC), and Journal of Parallel and Distributed Computing.

    View full text