Elsevier

Information Systems

Volume 91, July 2020, 101488
Information Systems

Compatible byte-addressable direct I/O for peripheral memory devices in Linux

https://doi.org/10.1016/j.is.2019.101488Get rights and content

Highlights

  • Buffered I/O requires a disk cache. Direct I/O is not byte-addressable.

  • Memory-mapped direct file I/O for persistent memory is not compatible for existing applications.

  • This paper presents a new I/O layer, byte direct I/O (BDIO), in Linux.

  • BDIO requires no changes of file I/O interface.

  • BDIO bypasses the page cache even if applications use the buffer I/O interface.

  • BDIO uses a byte-addressable standard file interface.

  • BDIO can support peripheral memory devices that cannot be accessed by the MMU.

  • BDIO was implemented in Linux kernel.

Abstract

Memory devices can be used as storage systems to provide a lower latency that can be achieved by disk and flash storage. However, traditional buffered input/output (I/O) and direct I/O are not optimized for memory-based storages. Traditional buffered I/O includes a redundant memory copy with a disk cache. Traditional direct I/O does not support byte addressing. Memory-mapped direct I/O, which optimizes file operations for byte-addressable persistent memory and appears to the CPU as a main memory. However, it has an interface that is not always compatible with existing applications. In addition, it cannot be used for peripheral memory devices (e.g., networked memory devices and hardware RAM drives) that are not interfaced with the memory bus. This paper presents a new Linux I/O layer, byte direct I/O (BDIO), that can process byte-addressable direct I/O using the standard application programming interface. It requires no modification of existing application programs and can be used not only for the memory but also for the peripheral memory devices that are not addressable by a memory management unit. The proposed BDIO layer allows file systems and device drivers to easily support BDIO. The new I/O achieved 18% to 102% performance improvements in the evaluation experiments conducted with online transaction processing, file server, and desktop virtualization storage.

Introduction

DRAM takes on the important function of the main memory in computers, and it is also used to cache and buffer data to improve disk performance. Recently, DRAM has been used as a storage system to provide a fast response time that cannot be provided by disk and flash systems. In-memory computing is used for intensive random accesses in various fields such as large-scale caching systems [1], [2], in-memory databases [3], cloud computing [4], [5], [6], virtual desktop infrastructure [7], [8], and web search engines [9].

A disk cache improves the read performance when the hit rate is high. Even a 1% miss ratio for a DRAM cache can lead to a tenfold reduction in performance. A caching approach could lead to the faulty assumption that “a few cache misses are okay” [10]. An alternative choice to disk caching is a ramdisk.

A ramdisk is a software program that takes up a portion of the main memory, i.e., DRAM chips on DIMM modules. The computer uses this block of memory as a block device that can be formatted to a file system format, then mounts the file system on it. It is sometimes referred to as a software RAM drive to distinguish it from a hardware RAM drive, which is provided by a type of solid-state drive (SSD).

The performance of a ramdisk, in general, is orders of magnitude faster than other forms of storage media, such as an SSD (up to 100 times faster) or a hard drive (up to 200 times faster) [11]. It is the fastest type of storage media available, but it cannot retain data without power. To address this problem, many studies have focused on various types of ramdisk systems, a single system to cluster systems [4], [12], [13].

Thanks to practical developments that have overcome the volatility of RAM, RAM has become a storage medium in more systems. The ramdisk has the same interface as the conventional hard disk, but the traditional block interface is not optimized for RAM as a storage medium. Prefetching, disk scheduling, and the disk cache are designed for hard disk drives, and these degrade the performance of the ramdisk. Prefetching and disk scheduling can be easily and transparently turned off for applications, but the disk cache cannot.

Block devices, such as hard disk drives, transfer data in block units. The disk cache allows for applications to process input/output (I/O) in byte units and it improves the I/O performance. The disk cache, referred to as the page cache in Linux, is configured as some part of the main memory. However, the page cache is useless for memory devices such as ramdisk and persistent memory (PMEM).

The advent of PMEM and the evolution of large-scale memory have led to many challenges in the field of data storage. Direct Access (DAX), a new optimized I/O interface for PMEM is now available for Linux. RAM-based file systems were designed to optimize themselves for temporary files.

PMEM such as phase-change RAM, magnetoresistive RAM, and ferroelectric RAM, leads to challenges for file system designers.

PMEM is byte-addressable and directly accessible from CPU via the memory bus. It offers performance within an order of magnitude of that of DRAM [14].

The byte-addressability of non-volatile memory can make file systems more reliable and simpler. Strongly reliability file systems using PMEM have been proposed in several studies [15], [16]. For instance, Dulloor et al. implemented the Persistent Memory File System (PMFS), a light weight POSIX file system, that exploits PMEM’s byte addressability and offers direct access with memory-mapped I/O [14], [16].

Direct access (DAX) is a mechanism defined by the Storage Networking Industry Association (SNIA) as part of the non-volatile memory (NVM) programming model that provides byte-addressable loads and stores when working with a PMEM-aware file system through a memory-mapped file interface [17].

Both the ext4 and XFS file systems are capable of DAX. DAX-enabled file systems support the legacy interface, but the direct access is only achieved by the new memory mapping programming model of DAX. Many existing applications that use the traditional I/O application programming interface (API) cannot utilize the features of DAX.

Moreover, DAX requires that PMEM can be accessed as the main memory by the memory management unit (MMU). DAX is not suitable for peripheral memory devices that are not accessible by the MMU. A peripheral memory device can be a clustered storage using RAM [4], an SSD-backed RAM disk [18], or a hardware RAM drive.

A RAM-based file system appears as a mounted file system without a block device such as the ramdisk. The temporary file system tmpfs is a typical RAM-based file system that appears in various UNIX-like operating systems. It creates a file system using the shared memory of the kernel, provides a byte interface without the page cache, and directly transfers data between shared memory and user memory. On reboot, everything in tmpfs is lost, so it is used for temporary file storage. It has no recovery scheme after a reboot even if a nonvolatile memory is used.

Tmpfs provides the best performance as a file system to store temporary files. It uses only the main memory, that means it cannot be used with the other durable storage device such as RAMCloud [4], Apache Hadoop [19], [20], [21], SSD-backed RAM disk, and persistent memory.

A new interface must support peripheral memory devices, byte accessibility, and compatibility with the existing applications. The traditional direct I/O and latest memory-mapped direct I/O interfaces do not support all of these needs at once.

Buffered I/O utilizes the page cache, so it aggregates small amounts of data into an integer multiple of the block size. Consider, for example, a process that requests one character from the kernel. Then, the kernel loads the corresponding block into the page cache. If the process reads the next single character, the request immediately responds with the already loaded block. For a write example, a process sequentially writes one byte of data for each write call. The kernel buffers them in a page and flushes it at a later time.

Fig. 1 shows the kernel structures and data paths for buffered I/O and direct I/O. In buffered I/O, there exist two memory copies; from the ramdisk to the page cache and from the page cache to the application buffer. When direct I/O is used for a file, data is transferred directly from the disk to the application buffer, bypassing the page cache.

Direct I/O bypasses the page cache, but it has several constraints, so applications that use buffered I/O for byte-range operations cannot be easily changed to use direct I/O. With direct I/O, application programs must obey the constraints of the block interface. The user memory and the file position used in read() and write() calls must be aligned in the logical block size. That is, the user memory address, the request size, and request location must be an integer multiple of the logical block size, which is typically 4096 bytes. User applications can obtain the logical block size of the file system with the BLKSZGET ioctl() call. Direct I/O is enabled by calling the open() call with the O_DIRECT flag.

Memory-mapped direct I/O [14], [22], [23] enables byte- addressable direct access without a system call after establishing a memory mapping, thus providing a significantly low latency after the mapping. DAX uses a new programming interface using persistent memory development kit (PMDK). Also other studies use their own user library [22], [23].

DAX-enabled file systems support the legacy interface, but which does not provide the direct feature of DAX. Many traditional existing applications use the standard file API and do not use the direct feature that is provided by the memory-mapping. For such applications, we need a new I/O layer that makes the standard file API support byte-addressability and direct accessibility without needing any changes in the existing applications.

A peripheral memory device is a memory-based storage that is interfaced with the peripheral I/O bus and not accessed by memory mapping. Such devices cannot use memory-mapped direct I/O such as DAX. Thus, we need another new approach for them.

Networked ramdisks (RAMCloud and Apache Hadoop with ramdisk), SSD-backed ramdisks, and hardware ramdisks interfaced with the I/O bus are types of the peripheral memory device.

RAMCloud utilizes remote ramdisks of clustered nodes to provide durability to DRAM [4], [13]. It aggregates the main memories in thousands of servers, which keep the information entirely in DRAM. RAMCloud maintains redundant data copies across multiple nodes; thus, it can recover from crashes providing durable and available storage.

Apache Hadoop with ramdisk is a popular open-source software for reliable and scalable distributed computing. Apache Hadoop uses its own Hadoop File System (HDFS), which is a distributed file system that can scale form a single cluster to hundreds of nodes. HDFS [19] stores multiple replicas of a block on different data nodes, thereby providing availability and robustness. It also supports ramdisks on data nodes [20], [21]. The data nodes will flush in-memory data to the disk asynchronously. The Hadoop cluster system has its own filesystem on the ramdisk or a persistent storage.

SSD-backed ramdisk provides strict durability in a local node that uses a ramdisk. This is similar to a mirrored disk array that is composed of a ramdisk and a flash-based SSD. Write requests are delivered to both the ramdisk and the SSD, but read requests are served by the ramdisk only [18]. This storage device is implemented as a typical block device, so it cannot use tmpfs and DAX.

Flash-based SSDs can potentially utilize up to the full bandwidth of the I/O bus by maximizing the parallelism of multiple flash chips, which have a lower latency than a disk. Flash cannot replace RAM storages for the following reasons:

  • Tenfold slower latency: To transfer a 4 KiB block, flash needs 50 us and DRAM needs 5 us. Flash has additional latency in its device driver, the host bus adaptor, and the controller in the SSD.

  • Limited lifetime: An SSD becomes unreliable beyond a limited number of program/erase (P/E) cycles. A write causes a programming cycle and may cause an erasure cycle; these cycles wear out the tunnel oxide layer of the transistors. A 2-bit multi-level cell (MLC) flash memory fabricated using the 2x nm process has a maximum lifetime of 3,000 program/erase cycles. A 3-bit MLC flash memory has a lifetime of only a few hundred cycles. Write-intensive workloads may make the lifetime much shorter than the warranted lifetime. Thus, SSD reduces (throttles) write performance by adding throttling delays to write requests, so as to guarantee the required SSD lifetime [24].

  • Poor performance for small I/O with low concurrency. An SSD can perform using the full bus bandwidth using tens or hundreds of independent flash memory chips, but a large number of concurrent requests are required to utilize all independent chips. A small request consisting of a single process just utilizes only a single chip, thereby leading to low performance. Fig. 2 compares the 4 KiB random read performances for one process and 512 processes in an SSD and a ramdisk. The y-axis is in logarithm scale. Here, the SSD exhibits a 10 times performance gap for the different numbers of processes, but the ramdisk shows a three times performance gap.

The traditional block device suffers from an additional memory copy with the page cache. However, direct I/O cannot process byte-range requests. The conventional RAM-based file system cannot be used with peripheral memory devices such as RAMCloud, HDFS, or SSD-backed ramdisks, and etc. Flash cannot be a complete replacement of RAM. DAX requires a new programming interface that is given by persistent memory development kit (PMDK) [25]. This paper presents a new compatible I/O layer that is called byte-addressable direct I/O (BDIO) in Linux for RAM-based storages. BDIO has the following characteristics.

  • Compatibility: BDIO is transparent to applications. No changes to applications are required for them to support it. The proposed scheme utilizes the standard file API.

  • Page cache bypass: The application bypasses the page cache even if the buffered I/O interface is used.

  • Byte-range I/O: Unlike direct I/O, which has a block interface, the proposed I/O has a byte interface. Therefore, an application program using byte-range buffered I/O can use BDIO without modification.

  • Peripheral memory devices: BDIO can support peripheral memory devices that cannot be accessed by the MMU.

  • Consistency with buffered write: The proposed scheme provides data consistency even if buffered I/O is mixed with BDIO. This is useful for the SSD-backed ramdisk, which must use buffered writes to SSD but can allow byte direct read (BDR) from RAM to improve read performance.

BDIO was implemented in a Linux kernel. The block device and the file system that supports BDIO need an additional interface. We implemented a BDIO-capable ramdisk and a BDIO-capable SSD-backed ramdisk, and revised XFS and ext4 to support BDIO.

Section snippets

Design and implementation of byte direct I/O

BDIO transfers data directly between a ramdisk and a user application buffer, where the application performs byte-range I/O with the same interface for the buffered I/O without any modification of applications.

Performance evaluation

Linux kernel 3.10 and 4.15 were modified for BDIO. In addition, the ramdisk supporting BDIO, the SSD-backed ramdisk supporting BDR, and XFS and ext4 file systems modified to support BDIO were implemented. The evaluations used the 3.10 kernel and XFS.

Experiments were performed with two 8-core 3.4 GHz Xeon E5-268 W CPUs that were interconnected by eight memory interconnection channels with 128 GiB of main memory. The ramdisk capacity was set to 122 GiB.

BDIO and BDR were evaluated using various

Conclusion

This paper presented a new Linux I/O layer, BDIO and BDR, that uses the standard file API for RAM-based peripheral storage. BDIO and BDR bypass the page cache without modifying buffered I/O applications to use direct I/O and can perform byte-range I/O without redundant memory copy. In addition, BDR can provide data consistency while using buffered write for the SSD-backed ramdisk.

The BDIO ramdisk, SSD-backed ramdisk supporting BDIO, the BDIO-enabled XFS, and the BDIO layer of the Linux kernel

CRediT authorship contribution statement

Sung Hoon Baek: Writing - original draft. Ki-Woong Park: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported by a Jungwon University Research Grant (South Korea) (Management Number: 2017-030).

References (29)

  • LuoY. et al.

    A RAMCloud storage system based on HDFS: Architecture, implementation and evaluation

    J. Syst. Softw.

    (2013)
  • FitzpatrickB.

    Distributed caching with memcached

    Linux J.

    (2004)
  • ZhaoD. et al.

    Hycache: A user-level caching middleware for distributed file systems

  • LahiriT. et al.

    Oracle TimesTen: An in-memory database for enterprise applications

    IEEE Data Eng. Bull.

    (2013)
  • OusterhoutJ. et al.

    The RAMCloud storage system

    ACM Trans. Comput. Syst.

    (2015)
  • ZahariaM. et al.

    Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing

  • UtaA. et al.

    Scalable in-memory computing

  • MillerK. et al.

    Virtualization: virtually at the desktop

  • RuestN. et al.

    Virtualization, A Beginner’s Guide

    (2009)
  • R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H.C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, et al....
  • OusterhoutJ. et al.

    The case for RAMClouds: scalable high-performance storage entirely in DRAM

    Oper. Syst. Rev.

    (2010)
  • datagramJ.

    RAMDISK software - What is RAMDisk?

    (2012)
  • S.T. Diehl, System and method for persistent RAM disk, US Patent 7,594,068, Google Patentsm,...
  • FlourisM.D. et al.

    The network RamDisk: Using remote memory on heterogeneous NOWs

    Cluster Comput.

    (1999)
  • Cited by (1)

    View full text