PB-NVM: A high performance partitioned buffer on NVDIMM

https://doi.org/10.1016/j.sysarc.2019.03.007Get rights and content

Abstract

Eviction processes in disk-based DBMSs are usually bottlenecks in write-intensive, high concurrent OLTP workloads because of the high IO latencies of the block devices and database systems generate redundant writes to guarantee atomic page updates. One solution is to store dirty pages from DRAM to flash SSD, then asynchronously flush those pages to storage. However, this approach has high lock contention because it relies on one centralized buffer shared by all of the eviction threads.

This paper presents a high performance partitioned buffer on NVDIMM (PB-NVM) as an extended cache between DRAM and flash SSDs. PB-NVM achieves high performance eviction processes and guarantees the atomicity of page updates without redundant writes or high contention of the centralized buffer. PB-NVM writes dirty pages from DRAM to disjoint buckets on NVDIMM and asynchronously flushes those dirty pages as a batch from buckets to flash SSDs without locking the upcoming eviction process. We implement the proposed scheme in InnoDB/MySQL and experiment with TPC-C and Linkbench benchmarks in a real NVDIMM server. Our empirical results show that, compared to the vanilla InnoDB, our proposed method improves the throughput by up to approximately 2.47 ×  while reducing the flushing time per transaction by up to 7 × .

Introduction

Traditional disk-oriented database management systems (DBMSs) have assumed the two-layer storage system: a fast, volatile DRAM for storing short-term working sets (i.e., buffer pool) and slower IO, large capacity, non-volatile block devices (i.e., SSDs/HDDs) for storing long-term data. Unfortunately, of the major components in on-line transaction processing (OLTP) DBMSs, the buffer pool is known to have the highest overhead [1]. When the buffer pool is full, a replacement policy, such as LRU, reclaims the free space by evicting dirty pages from the buffer pool to non-volatile storage. This eviction process is non-trivial and is mainly considered as a bottleneck for two reasons: (1) block devices have slow write IOs and (2) DBMSs must ensure atomicity and durability for individual evicted pages [2], [3].

Replacing magnetic disks with flash SSDs solves the slow write IO problem, but certain issues still remain. Due to asymmetric read/write in flash SSDs, writing a page on an SSD is slower than reading a page. Consequently, the buffer pool quickly fills up and the system spends more time reclaiming free slots in the buffer pool for both read and write requests [4]. Moreover, flash-based DBMSs should consider the overhead of garbage collection (GC) and wear-leveling mechanism of flash SSD [5].

The atomic write problem is complicated as well. In order to ensure atomic writes and durability for each evicted page, the DBMSs adopt redundant writes, which lead to high overhead of the system [2]. For example, InnoDB accumulates evicted pages on a double-write buffer (DWB) in DRAM, then writes each page twice when the DWB is full. The first one synchronously writes on the disk-resident DWB followed by an fsync, and the second one asynchronously writes on the original data. This redundant write problem becomes critical in flash-based DBMSs, due to the overhead of GC and the reduction of the flash lifespan [2], [4], [6]. Another major problem faced in DWB is lock contention. Because DWB is a single buffer shared among eviction threads, a thread must acquire the buffer’s lock before writing evicted pages on DWB. In addition, when the DWB is full, DBMS stalls all eviction threads until the preceding DWB persistently writes its last page on the disk.

Solving those two problems of the eviction process is one of the most significant challenges faced by researchers in this field. Kang et al. [4] adopt a flash-based cache between DRAM and block devices in order to transfer small random writes of evicted pages from DRAM buffer to a large sequential write onto the flash cache. When the flash cache is full, the propagation process asynchronously writes pages back to storage devices. The flash cache helps improve transaction throughput and reduces the recovery time. However, this approach loses its utility as flash SSDs continue to replace hard disks as the primary storage devices used in enterprise systems [5]. Thus, modern database systems require a new class of non-volatile device that is significantly faster than flash SSDs and guarantees atomic writes.

Fortunately, Non-Volatile Dual In-line Memory Modules (NVDIMMs) are good candidates for next-generation non-volatile caches that can solve the two problems of eviction processes. First, NVDIMMs combine DRAM memory with the mature NAND flash technology into the standard DIMM interface so that they are fast, like DRAM, and non-volatile, like SSDs. Second, they are commercially available and strongly supported by both hardware and software [7]. Finally, they have near-infinite endurance, so that we can eliminate the overhead of GC and wear-leveling of flash SSDs by replacing the flash cache with an NVDIMM cache. However, NVDIMMs are different from both DRAM and SSDs/HDDs in a number of aspects; thus, components in DBMSs are inappropriate with new primitives and need to be revisited [8], [9]. Moreover, if an NVDIMM cache approach uses only one single buffer shared among the eviction threads, that buffer may have high lock contention that leads to degradation of the caching performance.

Those observations and analysis motivate us to design a new NVDIMM cache architecture to improve the performance of the eviction process while guaranteeing atomic writes and eliminating the high contention of the single cache. In this scheme, we solve the slow IO problem and eliminate the flash cache overhead by adopting an NVDIMM cache layer between DRAM and the storage devices (i.e., SSDs). In order to reduce the lock contention, we split the single buffer into smaller buffers of equal sizes using partitioning algorithms and separate eviction process (i.e., the process that writes dirty pages from DRAM to the NVDIMM cache) from the propagation process (i.e., the process writes pages from the NVDIMM cache to SSDs when the cache is full). Our architecture guarantees atomic writes both from DRAM to the NVDIMM cache and from the NVDIMM cache to the SSD, using an open-source library [10].

Our approach differs from those of previous works in many aspects. Unlike the previous studies, which required specialized hardware and a modified flash translation layer (FTL) [2], [11], [12], we only implement the proposed scheme on the buffer management module of the DBMS. In addition, our scheme uses NVM as an extended cache between DRAM and the SSD instead of transaction logging [13], [14], [15], [16], [17] or primary storage [18]. For cache media, we use NVDIMMs instead of flash SSDs [4], [6], [19] in order to eliminate the overhead of GC, for the purpose of wear-leveling, and to extend flash’s lifespan. Our approach is similar to the existing NVDIMM cache approach [3], but we adopt partitioned buffers instead of one centralized buffer and leverage asynchronous IO as a batch of pages instead of a single page. The main contributions of this paper can be summarized as follows:

  • We propose a partitioned buffer (referred as PB-NVM) that exploits NVDIMMs as a non-volatile cache to achieve high-throughput transaction while guaranteeing atomic writes, durability, and recoverability of the database system with no costs of redundant writes and high lock contention.

  • We further improve the performance by integrating a single asynchronous flush from the NVDIMM to the SSDs into a large batch that leverages high-level parallelism in a flash SSD with multi-threaded architecture.

  • We implement the proposed scheme in InnoDB/MySQL and evaluate its performance with TPC-C and Linkbench in a real NVDIMM system. Our experimental results show that, compared to the original InnoDB/MySQL, our proposed method can improve the throughput by up to approximate 2.47 ×  while reducing the flushing time and other overhead significantly.

  • We also propose three partitioning algorithms and provide a detailed analysis of the trade-offs between algorithms.

The rest of this paper is organized as follows: Section 2 explains the atomicity of page writes in disk-based DBMSs, the background of NVDIMM technology, and our motivation. We describe in detail the architecture, supported operations, and partitioning algorithms of our proposed scheme in 3 Architecture, 4 PB-NVM operations, and 5, respectively. Section 6 presents the experimental results, analysis, and discussions. Related work and the conclusion are given in Sections 7 and 8, respectively.

Section snippets

Background

In this section, we describe how disk-based DBMSs guarantee atomic writes. We then present an overview of NVDIMM technology, its potential to solve the atomic write problem, and our motivation in this work.

Architecture

In this section, we provide an overview of the NVDIMM cache architecture that overcomes all three challenges mentioned in Section 2.3. We then discuss four key components of the architecture in detail.

PB-NVM operations

We now present key operations supported by PB-NVM. First, it guarantees atomic writes from the buffer pool to NVDIMM cache and from NVDIMM cache to flash SSDs. Second, it also serves as a read cache to speeds up the performance. Lastly, it supports recovery from a system crash. In particular, we detail how these operations implemented under the architecture explained above.

Partitioning algorithms

During an eviction process, partitioning algorithms determine which bucket in the Partition area the eviction thread writes on. Those algorithms affect the eviction performance, the remaining lifetime of pages in NVDIMM buffer, and the overhead of the propagation process. We propose three partitioning algorithms based on the file identification (fileid) and the page offset (pageid) named: EVEN, SINGLE and LESS. Fig. 3 illustrates how algorithms are different to each other with the assumption

Experimental settings

We conduct experiments with TPC-C [35] and Linkbench [36] benchmarks as the top client layer. We also use a write-intensive Linkbench workload modified from the original to test the system under write-heavy environment. To achieve realistic results, we implement PB-NVM by modifying the original InnoDB/MySQL in the recent version (i.e., 5.7.20) and run experiments in a real NVDIMM server with 48 cores Intel Xeon E5-2650v4, 128GB DRAM, 32GB NVDIMM-N and a 512GB Samsung 850 Pro SSD. We set the

Related work

Combining firmware and hardware can guarantee atomic page updates. For example, Fusion-io atomic flash drives [37] guarantee atomic write at FTL level and provide software APIs that allow DBMSs to skip redundant writes. This approach is available in commercial alternatives of MySQL such as MariaDB [38] or Percona [39]. Similarly, Kang et al. [2] address the atomically write problem by introducing DuraSSD equipped with a capacitor-backed SSD. Ouyang et al. [12] modify the flash translation layer

Conclusion

In this paper, we have exploited NVDIMM as a partitioned buffer between DRAM and block devices to boost up the performance of eviction process while guaranteeing atomic write, durability and recoverable. Partitioned buffer combines with batch-asynchronous IOs is outperforms single buffer when both types of buffer located in NVDIMM. Compared to the original InnoDB, our proposed method improves the throughput and reduces the flushing time, the amount of written data, and the number of fsync calls

Acknowledgments

This research was supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the “SW Starlab” (IITP-2015-0-00314) supervised by the IITP (Institute for Information & communications Technology Promotion), and supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2018R1A2B2005502).

Trong-Dat Nguyen received the M.S. degree from the School of Computer Science and Engineering, Kyungpook National University, Korea, in 2014. He is currently working toward the Ph.D. degree at Sungkyunkwan University, Suwon, Korea. His research interests include NoSQL DBMSs, flash-based database technology.

References (39)

  • S. Harizopoulos et al.

    Oltp through the looking glass, and what we found there

    Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data

    (2008)
  • W.-H. Kang et al.

    Durable write cache in flash memory SSD for relational and NoSQL databases

    Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

    (2014)
  • Y. Son et al.

    A log-structured buffer for database systems using non-volatile memory

    Proceedings of the Symposium on Applied Computing

    (2017)
  • W.-H. Kang et al.

    Flash-based extended cache for higher throughput and faster recovery

    Proc. VLDB Endow.

    (2012)
  • S.-W. Lee et al.

    A case for flash memory SSD in enterprise database applications

    Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data

    (2008)
  • J. Do et al.

    Turbocharging DBMS buffer pool using SSDs

    Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data

    (2011)
  • R. Chen et al.

    Bridging the I/O performance gap for big data workloads: a new NVDIMM-based approach

    Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on

    (2016)
  • A. van Renen et al.

    Managing non-volatile memory in database systems

    Proceedings of the 2018 International Conference on Management of Data

    (2018)
  • A. Eisenman et al.

    Reducing dram footprint with NVM in facebook

    Proceedings of the Thirteenth EuroSys Conference

    (2018)
  • Intel, pmem.io persistent memory programming, 2018,...
  • D.-H. Bae et al.

    2b-ssd: the case for dual, byte-and block-addressable solid-state drives

    2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)

    (2018)
  • X. Ouyang et al.

    Beyond block i/o: rethinking traditional storage primitives

    High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on

    (2011)
  • R. Fang et al.

    High performance database logging using storage class memory

    Data Engineering (ICDE), 2011 IEEE 27th International Conference on

    (2011)
  • S. Gao et al.

    Pcmlogging: reducing transaction logging overhead with PCM

    Proceedings of the 20th ACM International Conference on Information and Knowledge Management

    (2011)
  • J. Huang et al.

    Nvram-aware logging in transaction systems

    Proc. VLDB Endow.

    (2014)
  • G. Oh et al.

    Sqlite optimization with phase change memory for mobile applications

    Proc. VLDB Endow.

    (2015)
  • T. Wang et al.

    Scalable logging through emerging non-volatile memory

    Proceed. VLDB Endow.

    (2014)
  • J. Arulraj et al.

    Let’s talk about storage & recovery methods for non-volatile memory database systems

    Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

    (2015)
  • M. Canim et al.

    Ssd bufferpool extensions for database systems

    Proc. VLDB Endow.

    (2010)
  • Cited by (3)

    • Formalization of continuous Fourier transform in verifying applications for dependable cyber-physical systems

      2020, Journal of Systems Architecture
      Citation Excerpt :

      Future cyber physical systems will adopt emerging applications or emerging memory techniques [40–42]. The formal verification of cyber physical systems with emerging applications or emerging memory techniques has to consider these unique properties of emerging applications of memory techniques [43–47]. A possible solution is the cross-layer design with the verification of different components.

    • Mathematical Model of Data Partition Storage in Network Center Based on Blockchain

      2022, Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST

    Trong-Dat Nguyen received the M.S. degree from the School of Computer Science and Engineering, Kyungpook National University, Korea, in 2014. He is currently working toward the Ph.D. degree at Sungkyunkwan University, Suwon, Korea. His research interests include NoSQL DBMSs, flash-based database technology.

    Sang-Won Lee received the Ph.D., and the M.S. degree from the Computer Science Department, Seoul National University, Korea, in 1999, and 1994 respectively. He is a professor with the College of Information & Communication Engineering, Sungkyunkwan University, Suwon, Korea. He was a research professor at Ewha Womans University and a technical staff at Oracle, Korea. His research interest includes flash-based database technology.

    View full text