Elsevier

Information Sciences

Volume 502, October 2019, Pages 376-393
Information Sciences

RDMA-driven MongoDB: An approach of RDMA enhanced NoSQL paradigm for large-Scale data processing

https://doi.org/10.1016/j.ins.2019.06.048Get rights and content

Highlights

  • We propose an effective trade-off scheme among various design choices to achieve high-performance RDMA-driven MongoDB.

  • We put forward a RDMA context detection algorithm for determining TCP/IP-based or RDMA communication between two RDMA Mongo nodes.

  • We design the load-aware buffer registration mechanism for reasonable memory region management of RDMA channels.

  • We redesign the oplogs synchronization protocol with RDMA verbs in RDMA_Mongo, including the oplogs initial sync and steady state replication.

  • We have implemented RDMA_Mongo based on MongoDB v-4.1.1-59 which achieves better performance than the plain one.

Abstract

With the rapid development of big data and data center networks, NoSQL database has won great popularity for its excellent performance in accelerating the performance of many online and offline big data applications, such as HBase, Cassandra and MongoDB. However, due to massive and frequent Create/Update/Retrieval/Delete (CURD) operations, the traditional TCP/IP protocol stack has difficulty to provide the required request rates and response latency for the large-scale NoSQL system. For example, large-scale data migration or synchronization among multiple clusters in a data center results in competition for network bandwidth with high delay. To mitigate such transmission bottleneck, we propose an approach of RDMA-driven document NoSQL Paradigm’ RDMA_Mongo, based on MongoDB. The performance of CURD operations is enhanced by one-sided Remote Direct Memory Access (RDMA) primitives (such as RDMA Read/Write) without involving the TCP/IP stack or CPU. Evaluation under RDMA-enabled network demonstrates that RDMA_Mongo significantly improves the CURD performance, compared with plain MongoDB. The results show that the average insert throughput increases by approximately 30%, the average delete throughput by over 30%, the update by up to 17% and the query throughput by 15% when facing large-scale data requests.

Introduction

With the rapid increase of large-scale data, which needs to be collected and analyzed online or offline in a data center, the distributed NoSQL (Not Only Structured Query Language) database systems such as MongoDB have been widely adopted in both academia and industry. NoSQL databases are designed not only to cope with the challenges in scalability and agility, which modern applications are confronted with, but also to take advantages of the commodity storage and computing power available today. As a typical implementation of NoSQL databases, MongoDB employs flexible document data model, auto-sharding, replica sets as core features, and is the only NoSQL database to support multi-document transactions currently. CURD document operations, replica sets and auto-sharding in MongoDB all rely on the underlying network communication. Furthermore, networking for big data [50] is extremely vital to accommodate the demand for big data processing. In the case of high concurrency of big data processing, it is found in our evaluation that the transmission load using traditional network protocol stack has become a bottleneck of MongoDB.

Meanwhile, as an emerging underlying network technology in recent years, RDMA (Remote Direct Memory Access), especially RoCE (RDMA over Converged Ethernet), has been frequently adopted in many industrial datacenters [10], [14], [16], [25], [38], [41], [43]. It offers ideal high throughput, low latency and CPU-bypassing for big data applications like HDFS [13], [36], memory file system [22], analysis and modeling on big data [5], [33]. RDMA network protocol uses zero-copy technology and bypass OS kernel to directly access registered memory in remote host.

Fig. 1 shows the difference between TCP and RDMA. In terms of TCP, the data in application buffer is encapsulated by the sockets API in user space. Then the encapsulated data is written to network card through the sockets and TCP/IP layer protocols of OS kernel. However, in terms of RDMA, the data in application buffer is encapsulated by IB Verbs API in user space, and the encapsulated data is then directly written to the RDMA network card (RNIC) without any involvement of the operating system kernel.

The mainstream RDMA technologies include InfiniBand, RoCE and iWARP. RDMA primitives can be divided into message-oriented two-sided verbs (such as RDMA SEND/RECV) and memory-sharing one-sided verbs (such as RDMA READ/WRITE) [26], [27]. The message-passing communication requires that the receiver should start the transmission session and register a communication buffer. After that, the sender prepares to modify the data in the remote receiver’s memory buffer and accomplishes the data exchanges. However, the problem is, that in such a message-oriented mode, data copy between data memory buffer and communication memory buffer is still unavoidable. The RDMA shared-memory mode merges the data memory buffer and communication memory buffer by allowing sender and receiver sides to remain completely passive. Existing RDMA driven data center applications, called RDMA-Enhanced Paradigm, fall into two categories: RDMA-Enhanced System [4], [13], [14], [15], [16], [17], [18], [19], [20], [22], [28], [29], [36], [38], [39], [43], [44], [46], [47], [49] and RDMA-Enhanced Algorithm [2], [6], [35], [42]. For instance, Nessie [4] proposed a fully client-driven key-value store system using RDMA for high performance. APUS [42] proposed the first RDMA-based Paxos consensus algorithm. Existing RDMA-driven systems mainly focus on distributed file system [13], [20], [22], [36], key-value store [4], [14], [15], [16], [28], [38], [43], [44], [49], parallel database [19], relational database [18], [29], as well as memory transaction [17], [46], [47]. However, to the best of our knowledge, there have been no attempts to enhance document-based NoSQL databases in RDMA-enabled networks. Therefore, can we accelerate document-based NoSQL system with RDMA to mitigate the transmission bottleneck?

This paper presents RDMA_Mongo, which is the first RDMA driven document-based NoSQL paradigm. We first demonstrate a detailed analysis of MongoDB network transport designs. Through analyzing the key stages in the NoSQL CURD operations with high concurrency, we find that there are two key challenges: (1) to decrease frequent CPU context switching caused by data transmission in traditional network and (2) to reduce waiting time on the MongoDB client side.

To address the aforementioned challenges, specifically, RD-MA_Mongo automatically detects whether RDMA NICs and related libraries (such as libibverbs, librdmacm) are supported at both local and remote hosts, and then determines whether to use RDMA verbs or traditional network sockets. Second, based on the current system load (especially memory usage), RDMA_Mongo determines the appropriate buffer size on demand and registers RDMA communication memory regions.

Third, RDMA_Mongo introduces non-blocking data transmission with the power of RDMA Completion Queue (CQ). Without modification on the MongoDB existing network transport interface, RDMA_Mongo rewrites the socket implementation exploiting RDMA one-sided shared-memory primitives, which can coexist with traditional network transport mode. Finally, indigenous MongoDB exploits the replica set mechanism to achieve high availability, that is, the MongoDB cluster has one primary node and multiple secondary nodes at the same time. The secondary nodes need to continuously pull oplogs (Operation Log) from the primary node and replay the idempotent oplogs for database synchronization. To accelerate such process, we optimize and redesign the oplogs synchronization protocol based on RDMA primitives.

The main contributions of our work can be summarized as follows.

  • We analyze thoroughly all the various design choices of RDMA-driven paradigm and demonstrate an effective trade-off among different design options for high performance RDMA_Mongo.

  • We put forward a RDMA context detection algorithm for determining TCP/IP-based or RDMA communication between two RDMA_Mongo nodes and also propose load-aware buffer registration mechanism for reasonable memory region management of RDMA channels.

  • We redesign the oplogs synchronization protocol with RDMA primitives in RDMA_Mongo, including the optimization of oplogs initial sync and steady state replication, in which RDMA completion event channel is used to receive messages asynchronously.

  • We have implemented RDMA_Mongo based on MongoDB version 4.1.1-59. The detailed evaluation in RDMA-enabled network demonstrates that RDMA_Mongo significantly improves the CURD performances in average insert by approximately 30%, update by 17%, query by 15% and delete throughput by over 30%, when facing large-scale data.

The following section, that is Section 2, describes the typical NoSQL categories, RDMA and recent efforts on RDMA-Enhanced Paradigm, then introduces the problem statement. Section 3 demonstrates the overview of RDMA_Mongo architecture and discusses the tradeoff among different design choices. In Section 4, RDMA context detection algorithm and load-aware buffer registration mechanism are proposed, and the detail of RDMA-enabled oplogs synchronization is given. In Section 5, we evaluate our system and compare it against the plain MongoDB. In Section 6, we make extensive investigations of recent works related to modern network hardware, kernel-bypass networking, RDMA protocol stack optimization, and RDMA-driven NoSQL. Section 7 comes to the conclusion of this paper and our expectation for future work.

Section snippets

NoSQL

NoSQL(Not Only SQL) database generally refers to the non-relational database [32]. With the increase of data scale, traditional relational database is no longer capable to cope with such large amounts of data. Therefore, NoSQL database becomes more and more popular among developers [24]. The concept was first used in 1998 and was picked up in 2009 [11]. There are four mainstream data models of NoSQL database: Key-value, Column-oriented, Document and Graph [11], [24]. Table 1 shows the NoSQL

RDMA_Mongo design

RDMA_Mongo is a RDMA-aware design for document-based MongoDB while effectively taking advantage of RDMA design space for MongoDB transport layer. In this section, we first demonstrate the overview of RDMA_Mongo, including system architecture and RDMA-involved key communication between different components. Afterwards, we describe the overall design space for RDMA-enhanced system optimization. Finally, we discuss the tradeoff among different design choices with the aim of higher throughput and

Implementation

This section presents a detailed introduction for RDMA-based context detection, buffer registration strategy, verbs selection and oplogs synchronization employed by RDMA_Mongo. RDMA context detection and load-aware buffer registration algorithm are executed during initialization and connection establishment phase, respectively. RDMA_Mongo oplogs synchronization phase contains RDMA-enabled oplogs fetching algorithm in the secondary node and oplogs observing algorithm in the primary node. The

Evaluation

In this section, we evaluate RDMA_Mongo’s throughput, latency and consuming time over RoCE NICs from two dimensions: single-threaded performance Section 5.2 and muti-thread Section 5.3 performance. In each dimension, we evaluate the performance among four operations: insert, delete, update and query. In Section 5.1, we describe our experiment setup. In Section 5.2, we make a comparative performance analysis of RDMA_Mongo using a single thread. In Section 5.3, we evaluate the consuming time

Related work

Modern Network Hardware: There are three types of RD-MA networks: Infiniband [34], RoCE, and iWARP [37]. Infiniband is a network designed for RDMA, which guarantees reliable transmission from the hardware level. RoCE and iWARP are both Ethernet-based RDMA technology and they support the corresponding verbs interface [3]. Specially, the RoCE protocol has two versions, RoCEv1 and RoCEv2 [48]. The main difference between the two versions is that RoCEv1 is based on the RDMA protocol implemented by

Conclusion

Inspired from RDMA-enhanced paradigm, we propose the first RDMA-enabled document-based NoSQL paradigm named RDMA_Mongo. It can efficiently exploit the feature of OS kernel-bypassing and zero copy in PFC-enabled RDMA network to mitigate the traditional network transmission challenges: frequent CPU involvement and long waiting time for MongoDB client with high concurrency. In this paper, we particularly analyzed the MongoDB network transport layer. Further, we demonstrated a rich design space and

Disclosure of conflicts of interest

None.

Acknowledgements

We sincerely thank the anonymous reviewers for their insightful comments. We would like to thank Pengzhi Zhu and Qingchun Song, members of HPC-AI International Advisory Committee, for their technology support. We would like to thank Gaofeng Feng and Gil Blooh, researchers in Mellanox, for their helpful advices. The work of this paper is supported by National Natural Science Foundation of China under Grant (No.61873309, No.61572137, and No. 61728202), and Shanghai 2018 Innovation Action Plan

References (50)

  • C. Guo et al.

    RDMA over commodity ethernet at scale

    Proceedings of the 2016 ACM SIGCOMM Conference

    (2016)
  • J. Han et al.

    Survey on NoSQL database

    2011 6th international conference on pervasive computing and applications

    (2011)
  • J. Huang et al.

    High-performance design of HBase with RDMA over Infiniband

    2012 IEEE 26th International Parallel and Distributed Processing Symposium

    (2012)
  • N.S. Islam et al.

    High performance design for HDFS with byte-addressability of NVM and RDMA

    Proceedings of the 2016 International Conference on Supercomputing

    (2016)
  • N.S. Islam et al.

    Accelerating I/O performance of big data analytics on HPC clusters through RDMA-based key-value store

    2015 44th International Conference on Parallel Processing

    (2015)
  • J. Jose et al.

    Memcached design on high performance RDMA capable interconnects

    2011 International Conference on Parallel Processing

    (2011)
  • A. Kalia et al.

    Using RDMA efficiently for key-value services

    ACM SIGCOMM Comput. Commun. Rev.

    (2015)
  • A. Kalia et al.

    FaSST: fast, scalable and simple distributed transactions with two-sided ({RDMA}) datagram RPCs

    12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16)

    (2016)
  • F. Li et al.

    Accelerating relational databases by leveraging remote memory and RDMA

    Proceedings of the 2016 International Conference on Management of Data

    (2016)
  • F. Liu et al.

    Design and evaluation of an RDMA-aware data shuffling operator for parallel database systems

    Proceedings of the Twelfth European Conference on Computer Systems

    (2017)
  • X. Lu et al.

    Accelerating spark with RDMA for big data processing: early experiences

    2014 IEEE 22nd Annual Symposium on High-Performance Interconnects

    (2014)
  • Y. Lu et al.

    Multi-path transport for {RDMA} in datacenters

    15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18)

    (2018)
  • Y. Lu et al.

    Octopus: an RDMA-enabled distributed persistent memory file system

    2017 {USENIX} Annual Technical Conference ({USENIX}{ATC} 17)

    (2017)
  • P. MacArthur

    Userspace RDMA verbs on commodity hardware using DPDK

    2017 IEEE 25th Annual Symposium on High-Performance Interconnects (HOTI)

    (2017)
  • I. MongoDB, Top 5 considerations when evaluating nosql databases, White Paper, 2015,...
  • Cited by (14)

    • Enabling instant- and interval-based semantics in multidimensional data models: the T+MultiDim Model

      2020, Information Sciences
      Citation Excerpt :

      Moreover, they have also been specified at a (relational) logical level in SQL, providing different examples in the healthcare domain. Such approach may be applied to different technological architectures and could be also useful for the design and representation of temporal data that need to be analyzed/classified according to AI-based techniques [14,20]. As an overall view of the main features of our proposal, we argue that our data model addresses and provides some original contribution in five of the eleven modeling requirements for multidimensional data models, Pedersen et al. discussed in [32].

    • Consistent Low Latency Scheduler for Distributed Key-Value Stores

      2023, IEEE Transactions on Parallel and Distributed Systems
    • SDM: Sharing-Enabled Disaggregated Memory System with Cache Coherent Compute Express Link

      2023, Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
    • A Low-Cost Labeling Service for Satellite Imagery Data

      2022, 2022 International Conference on Information Technology Systems and Innovation, ICITSI 2022 - Proceedings
    • Geospatial Data of Local Chaiya Native Rice Crop Distribution

      2022, 7th International Conference on Digital Arts, Media and Technology, DAMT 2022 and 5th ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering, NCON 2022
    • An Enhanced Entity Model for Converting Relational to Non-Relational Documents in Hospital Management System Based on Cloud Computing

      2022, IETE Technical Review (Institution of Electronics and Telecommunication Engineers, India)
    View all citing articles on Scopus
    View full text