research-article

Public Access

IS-HBase: An In-Storage Computing Optimized HBase with I/O Offloading and Self-Adaptive Caching in Compute-Storage Disaggregated Infrastructure

Authors:
Zhichao Cao

University of Minnesota, Twin Cities, Minneapolis, MN

University of Minnesota, Twin Cities, Minneapolis, MN

0000-0001-6950-1776
View Profile

,
Huibing Dong

University of Minnesota, Twin Cities, Minneapolis, MN

University of Minnesota, Twin Cities, Minneapolis, MN
View Profile

,
Yixun Wei

University of Minnesota, Twin Cities, Minneapolis, MN

University of Minnesota, Twin Cities, Minneapolis, MN
View Profile

,
Shiyong Liu

Ocean University of China, Qingdao, Shandong, China

Ocean University of China, Qingdao, Shandong, China
View Profile

,
David H. C. Du

University of Minnesota, Twin Cities, Minneapolis, MN

University of Minnesota, Twin Cities, Minneapolis, MN
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 18 Issue 2Article No.: 15pp 1–42https://doi.org/10.1145/3488368

Published:12 April 2022Publication History

ACM Transactions on Storage

Abstract

Active storage devices and in-storage computing are proposed and developed in recent years to effectively reduce the amount of required data traffic and to improve the overall application performance. They are especially preferred in the compute-storage disaggregated infrastructure. In both techniques, a simple computing module is added to storage devices/servers such that some stored data can be processed in the storage devices/servers before being transmitted to application servers. This can reduce the required network bandwidth and offload certain computing requirements from application servers to storage devices/servers. However, several challenges exist when designing an in-storage computing- based architecture for applications. These include what computing functions need to be offloaded, how to design the protocol between in-storage modules and application servers, and how to deal with the caching issue in application servers.

HBase is an important and widely used distributed Key-Value Store. It stores and indexes key-value pairs in large files in a storage system like HDFS. However, its performance especially read performance, is impacted by the heavy traffics between HBase RegionServers and storage servers in the compute-storage disaggregated infrastructure when the available network bandwidth is limited. We propose an In- Storage-based HBase architecture, called IS-HBase, to improve the overall performance and to address the aforementioned challenges. First, IS-HBase executes a data pre-processing module (In-Storage ScanNer, called ISSN) for some read queries and returns the requested key-value pairs to RegionServers instead of returning data blocks in HFile. IS-HBase carries out compactions in storage servers to reduce the large amount of data being transmitted through the network and thus the compaction execution time is effectively reduced. Second, a set of new protocols is proposed to address the communication and coordination between HBase RegionServers at computing nodes and ISSNs at storage nodes. Third, a new self-adaptive caching scheme is proposed to better serve the read queries with fewer I/O operations and less network traffic. According to our experiments, the IS-HBase can reduce up to 97% network traffic for read queries and the throughput (queries per second) is significantly less affected by the fluctuation of available network bandwidth. The execution time of compaction in IS-HBase is only about 6.31% – 41.84% of the execution time of legacy HBase. In general, IS-HBase demonstrates the potential of adopting in-storage computing for other data-intensive distributed applications to significantly improve performance in compute-storage disaggregated infrastructure.

1 INTRODUCTION

Currently, in the most popular IT infrastructure, a compute cluster is disaggregated from its storage cluster (called compute-storage disaggregated infrastructure) to achieve higher flexibility, availability, manageability, and scalability [7, 10, 22, 39]. A compute node in the compute cluster has a relatively powerful CPU and a large memory space which supports several applications like various stateless services, caching, and compute services. The storage cluster consists of one or multiple large storage pools with different types of storage devices and it provides several storage interfaces including block, file system, key-value (KV), and object for different services running in the compute cluster. Compute servers and storage servers are connected via networks. On one hand, cloud service providers [17] (e.g., Amazon S3 [3], Microsoft Windows Azure [12], and Huawei Cloud [31]) and many large Internet companies (e.g, Facebook and Google) have deployed and implemented their IT infrastructures with this type of compute-storage disaggregation. On the other hand, as cloud computing is widely used [3, 12], the compute and storage services are naturally disaggregated.

However, the compute-storage disaggregated infrastructure also brings in new performance issues. All desired data are required to transfer through the network connections between the compute cluster and the storage cluster. On one hand, based on the benchmarking and evaluation in [22], the average I/O throughput of accessing remote storage through the network can be as low as 30–50 MBps. On the other hand, skew occurs naturally in many workloads and causes CPU and I/O imbalance, which will make the throughput even lower [10].

When the workloads are light and stable, there are fewer performance issues caused by the traffics through the network. However, when the spiked I/O requests are generated by some applications with intensive reads/writes, the available connection bandwidth can be easily over-saturated. The available bandwidth of other applications will be squeezed. The performance of the applications requiring high storage I/Os will be impacted. In this article, we choose HBase [4], which is a distributed KV store, to investigate the potential performance issues in the compute-storage disaggregated infrastructure and demonstrate the effectiveness of our proposed approaches.

HBase [4, 19, 21, 38] derived from BigTable [15] is an important and widely used distributed column-oriented KV store and its performance can be severally impacted when HBase is deployed in a compute-storage disaggregated infrastructure. In this article, we explore the potential of applying in-storage computing to improve the performance of HBase by reducing the traffics between compute servers and storage servers. RegionServers of HBase are responsible for receiving and responding to client queries. Typically, a RegionServer caches the KV-pairs from writing queries (e.g., Put and Delete) in MemStore and appends them to a write-ahead-log (WAL) to ensure data reliability. When a MemStore is full, KV-pairs are sorted and written to storage as a single file, called HFile, which consists of tens of data blocks. Read queries (e.g, Get and Scan) usually trigger many HFile block reads and cause a serious read amplification (i.e., the data being sent back to the client is much smaller than the data being read out from storage).

In the legacy deployment of HBase, for one RegionServer, its HFiles are usually stored in the same host via HDFS interfaces [5, 11, 35] (i.e., the host acts as both RegionServer and HDFS DataNode). However, in the compute-storage disaggregated infrastructure, RegionServers are running in the compute nodes of the compute cluster. HFiles are stored in the storage nodes of the storage cluster and all the I/Os will have to go through the connecting network between compute and storage clusters. In the compute-storage disaggregated infrastructure, the storage cluster provides various interfaces for different applications and HFiles can be stored as files or as objects. Therefore, in this article, we do not have any special assumptions on the storage systems for HFiles in the storage cluster. The in-storage modules being offloaded to the storage nodes use the same I/O interfaces as in HBase to interact with the storage system and these modules do not need to know the internal mechanisms like data replication, synchronization, and metadata management in the storage system. The offloaded module only consumes the compute resources and network bandwidth of storage nodes.

To investigate how serious the performance (especially read performance) of HBase is influenced by the network condition under compute-storage disaggregated infrastructure, we deployed HBase at compute cluster and evaluated its performance with different available network bandwidths. The major observations are: (1) For the KV workloads generated by YCSB with either a Zipfian or Uniform distribution [16], Get queries can cause an extremely high network traffic (e.g., the data being read out is about 40–67 times larger than that being sent back to a client); (2) Scan queries with a shorter scan length usually have a higher read amplification; (3) For Scan queries with filters, the more data can be filtered out, the larger the read amplification is; and (4) When the available network bandwidth is low, the throughput of Get and Scan queries become very low due to the long network transmission time of the HFiles data blocks.

To address the performance issue caused by the insufficient network bandwidth, we adopt the in-storage computing approach to redesign the read paths and compaction logic of HBase (called IS-HBase). Since a storage server, which may connect to multiple storage devices, has a relatively powerful CPU and a larger memory space than those in storage devices, more complicated functions/services can be executed in the storage servers to pre-process the stored data instead of only processing them in application servers or storage devices.

The concept of in-storage computing has been widely used in active disk drives [32, 33, 34], smart SSD [9, 18, 24, 25], Kinetic drive [1, 13, 27], data processing-based SSD [26, 29, 40, 41], and the KVSSD proposed by Samsung [2]. In-storage computing is also extended to other scenarios to improve the application performance, especially when the size of real requested data by an application is much smaller than the data being read out and delivered from storage. YouSQL [23] offloads several I/O intensive functions of MySQL like range queries to SSDs such that when “Select” or other range queries are called, the overall query per second (QPS) can be effectively improved. With the help of in-storage computing, not only the required I/Os are effectively reduced, but also it can achieve better parallelism from multiple storage devices. If a computing process of an application is not bottlenecked by the required I/Os, it can continue the execution of computing such that the overall throughput and performance can be greatly improved.

IS-HBase offloads part of the read logic and compaction routine from HBase’s RegionServers to storage nodes as a service module called In-Storage ScanNer (ISSN). For Get and some Scan queries (depends on the Scan query requirements and network conditions), data blocks are pre-processed by ISSNs in the storage nodes and only the useful results (the requested KV-pairs) are sent through the network and combined in the RegionServer. Since only the requested KV-pairs are transferred through the network, it effectively reduces the network bandwidth consumption. Therefore, a higher performance can still be achieved when the available network bandwidth is limited. Moreover, if multiple HFiles related to one query are scattered over different storage nodes, IS-HBase can search the KV-pairs in parallel. This further improves the performance. For a compaction, one ISSN is selected to execute the compaction of a set of HFiles such that it avoids the round trip transmission of KV-pairs to RegionServers and back during a compaction. It effectively relieves the potential performance degradation during compaction due to the high I/O and CPU pressure caused by compaction.

In-storage computing usually brings in a challenging caching issue to applications. In IS-HBase, since only the requested KV-pairs are sent to RegionServers for read queries instead of sending all relevant data blocks, the original block-based cache design at a RegionServer cannot be used anymore. To address this caching issue, we propose a self-adaptive caching scheme for IS-HBase. First, the cache space is dynamically allocated between complete data blocks and partial blocks of HFiles. The partial blocks cache the KV-pairs from the same data blocks, which are transmitted by ISSNs in response to read queries. Second, since the KV workloads may have strong key-space localities [14], IS-HBase upgrades some of the partial blocks into complete blocks to reduce the cache misses. Third, the cache space for complete blocks and partially blocks are dynamically changed to better adapt to the workloads and network bandwidth variations.

According to our experiments, the IS-HBase can achieve up to 95% and 97% of network traffic reduction for GET with Zipfian and Uniform distributions respectively, and the QPS is significantly less affected by the variation of available network bandwidth. Also, there is up to 94% data transmission reduction for Scan queries with column filters. Also, the IS-HBase with self-adaptive block cache can further improve the average QPS by 26.31% compared with the IS-HBase with KV-pair cache only. Moreover, the execution time of IS-HBase with in-storage compaction is only about 6.31%–41.84% of the compaction execution time of legacy HBase.

This article is organized as follows. We briefly introduce the system background of HBase and compute-storage disaggregated infrastructure in Section 2. Related work on active storage devices and near-data processing is discussed in Section 3. We evaluate and discuss the performance of HBase in a compute-storage disaggregated infrastructure in Section 4. The results motivate us to offload some I/O intensive module from HBase to storage nodes. In Section 5, we introduce IS-HBase which offloads the HFile scanner module originally in RegionServers to storage nodes as the in-storage processing unit. The proposed system architecture, the flow of data read logic, the in-storage compaction design, and the data consistency/correctness are also discussed in this section. A self-adaptive caching scheme is described in Section 6 and we present the evaluation results in Section 7. Finally, we conclude the article and discuss some future work in Section 8.

2 BACKGROUND

In this section, we first introduce the architecture, basic operations, and data management of HBase. Then, we will introduce the compute-storage disaggregated infrastructures.

2.1 HBase Preliminary

HBase [4], a distributed column-oriented KV database, is an open-source implementation of Google BigTable [15]. The architecture of a typical HBase is shown in Figure 1. It consists of three components: (1) HBase RegionServers receive and respond to the queries from clients; (2) An HBase Master decides the key range assignment, manages tables, and handles high-level services; and (3) Zookeeper monitors and maintains the states of the clusters. Since HBase is a KV based NoSQL database, it provides simple interfaces to achieve data read/write queries including Get, Scan, Put, and Delete. Get is a point-lookup query, which requests the KV-pair with a given key. Scan returns a set of KV-pairs that satisfy the scan requirement (i.e., start from a certain key, scan a range of KV-pairs, and filter out some KV-pairs). Put inserts new KV-pairs and Delete removes the existing KV-pairs from HBase.

Fig. 1. The legacy architecture of HBase.

HBase applies an LSM-tree [28] indexing scheme to persistently store the KV-pairs in a backend file system to achieve high write performance. KV-pairs are sorted and stored in files called HFiles. Different from LevelDB or RocksDB which are also LSM-tree-based KV-stores, HBase does not have the “level” concept. All HFiles are maintained at the same level. Therefore, the key range of one HFile may overlap with the key ranges of other HFiles. In a RegionServer, new KV-pairs or the special KV-pairs to indicate the deletions are first temporally cached in a memory buffer called MemStore. At the same time, to ensure data persistency, these KV-pairs are appended to a WAL in storage. One RegionServer maintains several HRegions. One HRegion is responsible for a key range and manages one MemStore. When a MemStore is full, the KV-pairs in the MemStore are sorted in a pre-defined order and written to storage as a new HFile. When a client wants to read a KV-pair (e.g., Get query), the MemStore and all HFiles that may contain the key of this KV-pair are searched. Since deleting and updating a certain KV-pair is achieved by inserting a new KV-pair with the same key with a delete mark or a new value respectively, there may be multiple KV-pairs with the same key being stored in Memstore and HFiles. Therefore, to respond to a Get request, HBase needs to find all these KV-pairs and a RegionServer combines the results before responding to a client. Only the KV-pairs that satisfying the read requirement (e.g., the latest KV-pairs) are returned to the client.

As more and more new KV-pairs are inserted, the deleted KV-pairs and invalid KV-pairs (e.g., KV-pairs that have been updated) should be cleaned to save space. Also, as the number of HFiles increases, more HFiles will be searched to respond to Get or Scan queries such that the read performance is impacted. Therefore, an operation called compaction is introduced to combine some HFiles into a single larger HFile. When a compaction is triggered in one RegionServer, some HFiles are selected to carry out the compaction based on certain pre-defined criteria. Then, the RegionServer sequentially reads the KV-pairs from these HFiles, compares their keys, and removes the invalid KV-pairs. All the valid KV-pairs are sorted and written back as a new large HFile.

An HFile is organized as a tree structure similar to a B+ tree. KV-pairs are stored in data blocks (e.g., 64 KB per data block) in a sorted order. Data blocks are appended from the beginning of an HFile. Then, other metadata blocks including indexing blocks, filter blocks, and other blocks containing the metadata information are stored after the data blocks. An indexing block maintains the key-ranges, the sizes, and the offsets of the data blocks. If a single level indexing is not enough to index the data blocks, a second or third level indexing block are created automatically. A RegionServer maintains a manifest including the mapping of HFiles to their key-ranges.

To look up a key for a Get query or find a start KV-pair of a Scan query, the query is first sent to one of the RegionServers which maintains the key-range containing that key. Then, based on the key-ranges of the HFiles, the HFiles whose key-ranges cover the key or overlapped with the requested key range are selected. The RegionServer starts a HFile scanner for each selected HFile to search KV-pairs with the given key and each HFile scanner will return the requested KV-pairs or Null if none exists.

An HFile scanner first searches an indexing block and locates the data blocks whose key-ranges cover the key. Then, the HFile scanner reads out the data blocks to a block cache in the RegionServer, and the RegionServer applies a binary search to the KV-pairs in the data blocks to check the existence of the requested KV-pair. Since there may be multiple KV-pairs that have the same key and be stored in different HFiles, the RegionServer builds a heap to sort the KV-pairs returned from all the HFile scanners and picks the top/latest one to return to the client.

For Scan queries, a client specifies the start position of the scan via providing a start-key. The process of finding a start-key of each HFile is the same as that of Get. To continuously get the consecutive KV-pairs after the start-key, a client can call Next() once the start-key (i.e., an existing key equals to or greater than the start-key) is identified. When Next() is called, the top KV-pair is popped out from the heap of the scanner in RegionServer and returned to the client. Then, the HFile scanner whose KV-pair is popped out moves to get the next KV-pair if it exists and inserts the new KV-pair to the heap. A new top KV-pair is then generated from the heap. One example is shown in Figure 2.

Fig. 2. One example of the read logic in legacy HBase. When a client issues a Get or Scan request with a given key, the RegionServer first determines which HFiles may be related to the current query (e.g., HFiles 1, 3, 6, 7, 12, and 19). Then, the RegionServer starts a HFile scanner for each HFile involved. An HFile scanner reads the data blocks from HFiles to the block cache and applies a binary search to locate the first key that satisfies the read requirement. Then, the relevant KV-pair is inserted to a read heap. The read heap returns the top KV-pair to the client.

Each RegionServer maintains a read cache for the accessed blocks (called block cache) including both data blocks and metadata blocks being read out from HFiles. Each cached block is indexed by its block sequence number which is a unique ID for each block being created in the HBase. For read queries and also compactions, metadata blocks are first read to the block cache if there are cache misses and then searched by RegionServers. Based on the outcomes of the metadata block searches, the corresponding data blocks are fetched to the cache. Blocks in the block cache are evicted based on an LRU policy. Since metadata blocks have a much higher frequency to be accessed when responding to the read queries, these metadata blocks are usually cached in the block cache for a longer period of time. During a compaction, some existing HFiles are compacted to a new HFile and these old HFiles are deleted. The blocks from those deleted HFiles will never be used and they are gradually evicted from the cache.

HBase itself is not responsible for its data persistency. In the legacy HBase deployment, HFiles are stored in HDFS. A RegionServer and an HDFS DataNode can be deployed on the same host. If the HFiles managed by one RegionServer are also stored at a DataNode in the same host/server, a RegionServer does not need to read data from other hosts/servers. However, if HFiles are not stored in the host where the RegionServer is deployed, it will create some network traffics to read and write data blocks. Due to the LSM-tree-based design of HBase, it may create serious read and write amplifications. One data block usually consists of hundreds of KV-pairs. When one KV-pair is requested, the whole data block needs to be sent to the RegionServer through the network which creates the read amplification. Moreover, one key may exist in multiple HFiles (e.g., a KV-pair has been updated multiple times and stored in different HFiles), one Get query will trigger multiple data block reads which makes the read amplification even worse. The size mismatch between the requested KV-pair and the data blocks being read out not only consumes the available network bandwidth but also causes a performance degradation especially when the available network bandwidth is low.

2.2 Compute-Storage Disaggregated Infrastructure

During the past 10 years, the IT infrastructure has a trend to disaggregate the compute cluster from the storage cluster to improve its flexibility and scalability. Applications and services are running at compute nodes or they are managed in virtual machines (VMs) and containers to achieve better resources management. In such a setup, storage devices can no longer be tightly coupled with one compute node. Also, applications have various demands on different data I/O interfaces including block, file system, blob, object, and KV. Therefore, organizing the storage clusters into one or more shared storage nodes/pools and providing different I/O interfaces further enhance the availability of the storage services. Therefore, compute-storage disaggregation becomes an important architecture for large scale IT infrastructures.

As shown in Figure 3, in a compute-storage disaggregated infrastructure, the compute cluster focuses on computing-intensive applications. At the same time, the storage cluster provides data persistence capabilities to the applications and supports various I/O interfaces. A high-performance network is used to connect the storage and compute clusters. In today’s data center, the network bandwidth between two servers can be 10 Gbps or even higher. However, the disaggregation of storage and computation may lead to a performance penalty for data-intensive applications due to the extremely heavy network traffic created by tens or hundreds of different applications. Also, one storage node in the storage cluster serves a large number of requests at the same time. The network bandwidth dedicated to one application can be much smaller when many read/write requests are issued intensively by a large number of applications at the same time. This will cause a serious performance issue especially for the applications that have high I/O demands.

Fig. 3. Compute-storage disaggregated infrastructure.

HBase is a typical application that has high I/O demands and it can easily encounter the performance penalty in a compute-storage disaggregated infrastructure. RegionServers are deployed in compute nodes and all HFiles are stored in storage nodes. For read requests, once the block cache misses happen, data blocks are accessed from storage nodes through the network to RegionServers. If the client requests are spiked, a huge number of data blocks have to be read out from storage nodes during a short period of time. If the available network bandwidth of this computing node is insufficient, read performance will be extremely low (i.e., high latency and low QPS). Second, a compaction also creates large amounts of read and write traffics through the network. If a compaction is slowed down due to the low network bandwidth, both read and write performance will be heavily impacted due to the limited available network bandwidth for block reads and writes. The details will be explored in Section 4.

3 RELATED WORK

In this section, we first review the studies on active storage devices. Then, we will introduce the related work on near-data processing.

3.1 Active Storage Devices

During the past 60 years, persistent storage devices are mainly used in a passive way. That is, the host decides where and when to read and write data to a storage device. Data is read out from a storage device to memory and being processed. When it is necessary (e.g., the data in memory is new or updated), data is written to the storage to ensure persistency. In most cases, a block interface is used. However, due to the lower I/O throughput of storage devices compared with memory, storage I/O becomes the performance bottleneck for I/O intensive applications.

The storage devices produced in recent years, including HDD and SSD, are equipped with additional CPU cores and DRAM to support the management services in the device. It makes pre-processing data in storage devices feasible. The concept of the active disk was early introduced in [32, 33, 34]. The design of the active disk demonstrated the possibility and benefits of using the processing power of storage devices. Moving the list intersection operation (i.e., the operation to find the common elements between two lists) from memory to SSD is one example to avoid the heavy I/O between memory and SSD [40]. No matter which intersection exists, the two lists must be read out from SSD to memory originally to identify the intersections. If the list size is extremely large, it will cause a large number of SSD reads and the performance will be low. By benefiting from the relatively high search speed inside the SSD, processing the list intersection inside an SSD can improve the throughput and reduce the energy consumption. Similarly, more complicated searching functions can be implemented in an SSD to reduce the I/O accesses [41].

Applications like Hadoop and data mining are also I/O intensive. The benefits and tradeoffs of moving some Hadoop tasks to SSD are introduced in [24] and [29]. The interfaces of defining a job to be executed in SSD are provided and it can also be used in a large scale distributed environment. The in-storage processing based SSD is also used to accelerate data mining jobs [9]. The global merging operations are executed inside SSD which can reduce the required number of data reads and writes.

The database is a widely used I/O intensive application. Several research studies using active devices to speed up the performance [1, 2, 18, 23]. KVSSD [2] and Kinetic Drives[1] are two examples that implement a NoSQL database (KV store) inside a storage device. KVSSD integrates the KV store function with the SSD firmware and provides object-store like interfaces to applications. KVSSD can provide a high I/O efficiency and cost-effectiveness. Similar to KVSSD, a Kinetic drive [1, 13, 27] runs a mini operating system and a KV store in an HDD. Clients no longer need to read and write data based on fixed-size blocks in a Kinetic drive. It supports the KV interface including Get, Put, Delete, and Scan. Moreover, clients can access the disk via Ethernet connections. This makes the usage of storage devices more flexible. Active devices are also used to speed up SQL databases. By executing the Select query in SSD, the performance and energy consumption are both improved [18]. Moreover, YourSQL [23] offloads the execution of a Scan query to SSD. It can improve the throughput from 3.6X to 15X by reducing the data being read out from SSD to memory.

In general, an active storage device is a way of making tradeoffs between reducing I/O overhead and increasing processing time due to limited CPU and memory resources inside the device. By carefully selecting an application and its functions to be offloaded to the device, the overall performance and energy saving can be effectively improved.

3.2 Near-Data Processing

To bridge the performance gap between {memory and storage} and {CPU and memory}, the concept of near-data processing was proposed. Active storage devices and Processing-In Memory (PIM) are two typical approaches of near-data processing. Co-KV [36] and DStore [37] proposed to execute compactions for LSM-tree-based KV store in the storage device. In an LSM-tree-based KV store (e.g., LevelDB or RocksDB), the files being flushed out from memory needs to be compacted such that the deleted and updated KV pairs can be removed. Compactions can benefit both space needed and read performance. However, a large amount of data needs to be read out and write back during a compaction. This brings a high overhead. Co-KV and DStore offload the compaction operation from the host to the storage device like SSD such that it can avoid the required compaction traffic between the host and its storage devices. A compaction can be executed inside the storage devices as a background job.

Near-data processing is also used to speed up the processing of big data workloads. Biscuit [20] proposed a framework to apply near-data processing in SSD for big data workloads such as doing word count, string searching, DB scanning and filtering, and pointer chasing in graph processing. Biscuit provides a library for clients to program and deploy their data processing modules in the host and also in SSDs such that some of the I/O intensive jobs can be executed in SSDs. In this way, it can effectively reduce the traffic between storage devices and the host to improve the overall throughput.

4 PERFORMANCE DEGRADATION OF HBASE IN COMPUTE-STORAGE DISAGGREGATED INFRASTRUCTURES

To comprehensively investigate the performanc issues of HBase when it is deployed in a compute-storage disaggregated infrastructure, we conduct a series of experiments to explore the relationship between network condition and HBase performance. Note that, in this experiment, we do not run other applications at the same time. In production, there can be 50–100 I/O intensive application instances running in the same host. With a 10 Gbps network connection, the average bandwidth dedicated for one instance is about 100–200 Mbps. Therefore, in order to simulate the network bandwidth that can be used by HBase only, we vary the network bandwidth between 2 to 800 Mbps.

We use the following measurement metrics: (1) Network amplification (NET-AM): the data being transmitted from storage nodes through the network when responding to a client request divided by the amount of data being actually needed by the client; and (2) QPS. When read/write the same number of KV-pairs, a higher NET-AM indicates more data are required to transfer through the network. Thus, the performance is more sensitive to the available network bandwidth. QPS measures the overall performance of HBase from the client perspective. We use an Open Vswitch [30] to control the available network bandwidth between compute servers (where RegionServers are deployed) and storage servers (where HDFS DataNodes are deployed). We focus on evaluating the NET-AM of Put, Get, and Scan queries and the QPS correlation to the available network bandwidth. Compaction is triggered during Put queries and its influence is thus included.

4.1 Experimental Setup

Figure 4 depicts the deployment of HBase in our test environment. In a compute-storage disaggregated setup, there are two servers deployed with HBase RegionServers (i.e., compute servers) and two others deployed as the HDFS DataNodes (i.e., storage servers). An HDFS NameNode is deployed in one of the storage servers. Compute servers and storage servers are connected via an Open Vswitch [30] such that we can dynamically adjust the available network bandwidth to simulate the different levels of network congestion. The HDFS version is 2.7.3 and the HBase version is 1.2.6. Figure 5 shows the legacy deployment of HBase. We adopt the pseudo-distribution mode of HBase such that we can collect the network traffic more precisely. We keep one replicate of HFiles and the size of data blocks of HFiles is configured as 64 KB.

Fig. 4. The HBase deployment in a compute-storage disaggregated setup.

Fig. 5. The legacy deployment of HBase in our experiment.

We use YCSB [16] workload A to issue Get, Put, and Scan queries to HBase. We use two different hotness distributions of workloads: (1) Zipfian; and (2) Uniform. Zipfian issues more queries to some of the “hot” KV-pairs and the hotness follows a Zipfian distribution. It has a better key-space locality. Under Zipfian workloads, the block cache has a higher cache hit probability and it can effectively reduce the number of block reads for Get and Scan queries. In the workloads of Uniform distribution, the keys of queries are randomly generated. Therefore, there are no hot KV-pairs. We set the size of each KV-pair as 1 KB and thus one data block has about 64 KV-pairs. Since YCSB does not support the Scan query with filters, we developed a new benchmark tool that can generate Scan queries with column filters. The network status including the available bandwidth and traffic amount is collected by the Linux Iptables tool [6].

4.2 NET-AM Evaluation

In this subsection, we evaluate the NET-AM of Put, Get, Scan, and Scan with filter.

Put: In this experiment, YCSB writes KV-pairs into HBase in the same table and the same column family. We write different numbers of KV-pairs to HBase to find their corresponding NET-AMs. During the write process, HFiles are compacted and thus data blocks are read out from these HFiles. After compaction, valid KV-pairs are written back to storage as a new HFile. These compactions make NET-AM higher. As shown in Figure 6, theX-axis is the total number of records written to HBase, and the Y-axis is the measured NET-AM of writes (data being accessed by storage servers divided by data being sent to the RegionServers). HBase compactions cause high NET-AMs. For example, if we insert 10 million KV-pairs (about 10 GB data in total), the total data accessed by the storage server is about 18X (about 180 GB in total) after HBase is stabilized (i.e., compactions no longer happen). During the compaction, the data has to be read from several HFiles on HDFS servers and transferred to the RegionServer through the network. After the compaction, a new HFile has to be written back to the HDFS server again. The same data will be read and written several times which leads to a higher NET-AM for write during the testing.

As the total number of KV-pair increases, the NET-AM only increases slightly. For example, the NET-AM of inserting 10 million KV-pairs and 100 million KV-pairs are 18 and 18.7, respectively. As more KV-pairs are inserted, more HFiles need to be combined during a compaction. If deletion and updating queries are also called by clients, NET-AM can be a bit lower because some KV-pairs are cleaned during compactions. In general, due to the LSM-tree-based design (compaction + WAL), it always has a write amplification issue and creates a higher NET-AM.

Get: To evaluate the NET-AM and the QPS of Get queries, we first insert 100 million KV-pairs to the HBase table as the basic dataset for read via YCSB LOAD operations. Then, we use YCSB to generate Get queries with a uniform distribution (Read-Uniform) or Zipfian distribution (Read-Zipfian). The former randomly access the 100 million KV-pairs while the later access more on some hot KV-pairs and follows a Zipfian distribution. At the same time, we vary the ratio of KV-pairs being read from HBase among the 100 million KV-pairs. As shown in Figure 7, the X-axis represents the ratio of the requested data against the total dataset, and the Y-axis represents the NET-AM of reads (i.e., the size of data being read from a storage server to the RegionServer divided by the actual requested data size). When the ratio of the requested data is less than 0.1%, the NET-AM is about 90 in Read-Uniform, which is about 30% higher than those when the requested data ratio is higher than 1%. Similarly, for Read-Zipfian, when the ratio of the requested data is less than 0.1%, the NET-AM is about 80 and it is about 85% higher than those when the requested data ratio is higher than 1%.

Fig. 7. NET-AM of get queries with Uniform and Zipfian KV-pair access hotness distributions.

The evaluation results verify that when the requested data set size is extremely small (i.e., less than 0.1% KV-pairs are accessed), the read amplification is more serious. The amount of metadata (e.g., HFile metadata and metadata blocks) being read out is relatively constant because once those data is read to the RegionServer, they are most likely to be cached. When the requested data size ratio is larger, the metadata overhead is amortized and thus the NET-AM decreases. Note that, the NET-AM of read-intensive workloads with a Uniform distribution is always higher than that of a Zipfian distribution. The difference is caused by the variations of data locality and cache efficiency. If the requested data locality is very low (e.g., randomly gets KV-pairs in the entire key range), theoretically, the read amplification can be higher than 64 if the data block cache hit ratio is very low. That is, when a client reads out one KV-pair which is 1 KB, one data block of 64 KB size is sent to the RegionServer and the data block is rarely used again before it is evicted. If the KV-pair size is smaller, the read amplification can be even worse and it creates even a higher NET-AM.

A heavy network traffic load caused by the high NET-AM makes the Get performance sensitive to the available network bandwidth. As shown in Figure 8, when network bandwidth between a RegionServer and a storage server decreases, the QPS of Get decreases. We normalized all the QPS results based on the QPS when the network bandwidth is 2 Mbps. If the available network bandwidth is higher than 100 Mbps, the QPS of Read-Zipfian is explicitly higher than that of Read-Random. With a better read locality, Read-Zipfian has a higher cache hit ratio and thus its performance is better. However, as the bandwidth decreases to lower than 64 Mbps, the QPS of the two workloads are nearly the same and drops drastically. Under certain network conditions, the data being read from the storage server can saturate the available network bandwidth and the read performance is decreased dramatically. Therefore, in a compute-storage disaggregated infrastructure, the read performance of HBase can be easily impacted by the available network bandwidth due to the extremely high demand on the I/Os.

Fig. 8. Normalized QPS of get queries based on the QPS when the network bandwidth is 2 Mbps. KV-pairs are accessed with Uniform and Zipfian KV-pair access hotness distributions.

Scan: In this experiment, we first insert 100 million KV-pairs to an HBase table as the basic dataset for read via YCSB LOAD operations. We use YCSB workloadb to generate only Scan requests (the ratio of Get and Put are sets to 0). In each Scan query, it requires \( N \) consecutive records as the scan-length. \( N \) is configured as 10, 100, and 1,000 in different tests. For example, scan-10 refers to the test in which YCSB randomly selects a key as the start-key of a Scan iterator. Then, HBase calls 10 Next() to get the next 10 consecutive KV-pairs. We vary the total number of KV-pairs being read by Scan from 0.01% to 40% of the KV-pairs stored in the HBase.

As shown in Figure 9, the X-axis is the requested data ratio which varies from 0.01% to 40%, and the Y-axis is the NET-AM of Scan (the total data being read through the network divided by the data being returned to clients). When the scan length is smaller, the NET-AM is higher. The NET-AM of Scan-10 is about 4X and 30X higher than that of Scan-100 and Scan-1,000, respectively. The shorter a scan length is, the more KV-pairs in the data blocks being read out for Scan queries are not used. For example, if the Scan length is 10 and all 10 KV-pairs are in the same data block, the rest of 54 KV-pairs are not used by the current Scan. It causes a high read amplification. Therefore, the lower the ratio of requested data in a data block is, the higher the network traffic amplification will be. Similar to Get, as the requested data ratio increases from 0.01% to 40%, the NET-AM of Scan decreases due to a higher cache hit ratio. In general, with a smaller scan length and a lower requested data ratio, the NET-AM is higher.

Scan with filter: In some HBase use cases, a client may want to exclude some of the KV-pairs while scanning. A client can set the filter parameter of Scan to specify the filtering condition. For a KV-pair, its key composed of a raw key, column family, column, and time stamp. A client can use the Scan filter to effectively skip some of the KV-pairs and get only the KV-pairs needed. For example, in an HBase table, it has 100 different columns. A client wants to scan the KV-pairs from key A to key B. HBase returns all the KV-pairs that consist of a key in the key range [A, B]. If the client only focuses on the KV-pairs from a certain column, for example, Column 2, the client can set the column name as the filter condition. In this situation, only the KV-pairs that consist of a key in key range [A, B] and have Column 2 are returned. When using the Scan with filter, the requested data ratio is even smaller. However, all the data blocks that are in the key-range [A, B] need to be transferred from storage servers to the RegionServer. The NET-AM can be worse than that of a simple scan.

Since YCSB cannot directly generate Scan queries with filter, we developed a new benchmark to issue the scan requests with simple filter conditions. First, 200,000 records with 100 columns are written into HBase. Therefore, about 20 GB data is stored in HBase. The column filter ratio varies from 0% to 95% for each test with the same 50 MB requested data. For example, with 90% filter ratio, at least, there will be 500 MB data read out from HDFS and only 50 MB data is requested (i.e., 90% of the KV-pairs are filtered out before returning to the client). Figure 10(a) shows that the higher the filter ratio is, the higher the network amplification will be. The reason is that all column data has to be read out together in the same data blocks to the RegionServer in spite of the filter ratio. The filtering work is done on the RegionServer other than on HDFS servers.

Fig. 10. The NET-AM and QPS results of Scan with column filter.

Figure 10(b) shows the normalized QPS (based on the QPS when network bandwidth is 8 Mbps) of experiments with different column filter ratios and with different available network bandwidths. When network condition is the same (e.g., network bandwidth is 256 Mbps), Scan with higher filter ratio achieves higher QPS since fewer data will be transmitted between compute nodes and storage nodes. When the network becomes congested (e.g., the network bandwidth drops from 512 Mbps to 8 Mbps), the QPS of Scan with low filter ratios drops quickly.

4.3 Observations

The physical separation of RegionServers and HDFS servers in the compute-storage disaggregated infrastructure results in heavy network traffics between them, which makes the performance of HBase very sensitive to the available network bandwidth. We can make the following observations: (1) Compaction causes both high read and write amplification, which influences the overall performance; (2) The read amplification becomes more severe in Get and Scan due to the size mismatch of client requested KV-pairs and data blocks being transmitted; and (3) The performance of Scan with filter are highly influenced by the network condition and the filter ratio.

The aforementioned observations motivate us to redesign the HBase to make it adapt to the compute-storage disaggregated infrastructure. The basic idea is to move some I/O intensive modules from RegionServer to the storage nodes such that the heavy network traffics can be effectively reduced. More design details and the challenges being addressed are presented in Sections 5 and 6, respectively.

5 IS-HBASE DESIGN

In this section, we present a new HBase architecture that offloads the function of HFile scanner, called ISSN, from RegionServers to storage nodes. An ISSN follows the concept of in-storage computing, processes the data blocks of HFiles in storage nodes, and only returns the required KV-pairs to the RegionServer. By filtering out the irrelevant data through the ISSN, the new HBase design effectively reduces the network traffic especially for Get and Scan with filter. Moreover, the ISSN can help to achieve the in-storage compaction to avoid the performance impact caused by the heavy I/Os during a compaction.

However, several challenges need to be addressed when we deploy the new ISSN based HBase (called IS-HBase): (1) How to design a new protocol to achieve the communication between RegionServers and in-storage scanners in different storage nodes? (2) How to handle Scan and Get with the ISSN? (3) How to achieve compactions in the new architecture? and (4) How to ensure data consistency and correctness?

In the following subsections, we will first describe the new architecture of IS-HBase. Then, the communication between RegionServers and ISSNs is discussed. Next, we will demonstrate the processes to respond to Get and Scan queries as well as how compactions are processed. We also introduce the mechanisms to ensure data correctness and consistency.

5.1 System Overview

To generalize the compute-storage disaggregation infrastructure for in-storage computing based applications, we have the following assumptions:

—	There are two clusters including a compute cluster and a storage cluster. Each server in the compute cluster has a big memory space and powerful CPUs. Its local storage is fast (e.g., SSD-based) but small in terms of capacity. These compute servers are connected with a high-speed network and the network bandwidth of each server is the same. The storage cluster consists of a number of storage servers. Each storage server has enough memory and CPU resources to handle I/O requests. Compared with the compute server, the memory space of a storage server is limited and the storage server’s CPU is less powerful. Each storage server maintains a number of storage devices (e.g., HDDs or SSDs). The compute cluster and storage cluster are also connected via a high-speed network which has a fixed available network bandwidth.
—	The storage cluster is deployed similarly to an object storage system which is very popular in distributed file systems like Lustre and HDFS. The servers providing the capability of storing data objects are the storage nodes. Also, a small number of servers provide metadata services. Metadata servers receive the read and write requests from clients and respond to the addresses of the storage nodes that store the data (for read requests) or the storage node addresses to write the data (for write requests). Then, the client can directly contact the storage nodes to read and write the objects. When a read/write session is finished, the storage node updates the status of the objects with the metadata servers. Applications do not need to understand the internal mechanism of the storage system such as data replication, migration, synchronization, data integrity protection (e.g., checksum and erasure coding), and management. Applications interact with the storage system by calling the storage system APIs. Also, the in-storage modules are independent of the storage system. They use the same APIs to contact with the storage system. Differently, the in-storage modules may directly read and write data in the same host or consume the network bandwidth between storage nodes if the data is stored in other hosts.
—	The compute cluster may run multiple services and HBase is one of the services. To better utilize the memory/CPU resources, RegionServers of HBase are deployed as VMs on a number of physical servers. Therefore, each RegionServer has been allocated with its own memory space, network bandwidth, and CPU resources.
—	Storage nodes are also connected via networks and the available bandwidth is fixed. Different in-storage computing modules are offloaded for different services, and ISSNs are one of them that are running in storage nodes as backend demons. Each ISSN has its own memory and CPU utilization limits such that the performance of other primary storage I/O requests and the performance of other in-storage computing modules can be guaranteed.

The architecture of IS-HBase in a compute-storage disaggregated infrastructure is shown in Figure 11. In IS-HBase, in order to reduce the network traffic, a RegionServer does not actively read data blocks from storage nodes to its memory. Instead, in most cases, only the requested KV-pairs are sent through the network to the RegionServers. To achieve that, each storage node that stores HFiles is deployed with an HBase ISSNas the local processing module. Each ISSN maintains the key-range information of the HFiles in this storage server and caches some of the metadata/data blocks. If there is no read request from any RegionServer, an ISSN uses very little CPU resource. To ensure the primary workload and storage service quality, we offload the light-weight function modules to the storage nodes such as Get, Scan with filter, and compaction to the storage nodes.

Fig. 11. The architecture of IS-HBase deployed in a compute-storage disaggregated infrastructure. In this example, there are \( N \) compute nodes and \( M \) storage nodes. In each compute server, different applications are running, and data are exchanged via a high-speed network. The IS-HBase RegionServers are running as one application in the compute cluster. Each storage node that stores HFiles is deployed with one ISSN, and it manages the HFiles stored in that node. The metadata service of the object storage is running as one service in some of the storage nodes. RegionServer controls the ISSNs to achieve KV-pair reads and compaction.

We cannot directly deploy the RegionServer to the storage node as an ISSN due to the following reasons. First, compared with ISSN which is only responsible for the HFiles in the storage node, RegionServer is responsible for all the read, write, compaction of KV pairs in its key-ranges, split and merge the managed key-ranges, which is much heavier. Second, compared with the in-storage compaction which can be executed by multiple ISSNs in parallel (described in Section 5.3) and avoid the network traffic for file writes, RegionServer needs to read and write all the HFiles for compaction. It consumes tremendous network and CPU resources of the storage node, which will slow down all the other data I/O services. As discussed in Section 5.2.1, some Scan queries (e.g., long scan length and low filtering ratio) cannot benefit from the pre-processing in the storage node. Therefore, ISSN will not pre-process those queries in the storage node. Finally, deploying the RegionServer in the storage node violates the design of storage-compute disaggregated architecture. It creates difficulties to achieve service management, orchestration, and high scalability. Also it is likely that the bottleneck moves from the network to the CPU and memory resources in the storage node. Therefore, we only offload the light-weight function modules to the storage nodes including Get, Scan with filter, and compaction to the ISSN in the storage node.

When a RegionServer receives read requests based on the mapping from key-range to HFiles, the RegionServer generates a list of the HFile names that are relevant to this query. Then, the RegionServer queries the information of the storage nodes that store these HFiles from the metadata service. Next, the RegionServer set up communications with ISSNs whose storage nodes manage the relevant HFiles to process the HFiles in the storage nodes via ISSNs. Afterwards, the RegionServer receives the processed results from ISSNs, combines these results, and then returns the final KV-pairs to the client.

For write queries (e.g., Put or Delete), new KV-pairs are accumulated in a Memtable at RegionServer and logged to the WAL. The WAL files are stored as objects in the storage nodes. When a Memtable of one column family is full, the KV-pairs are sorted and written out as an immutable HFile to the storage cluster. During this process, the RegionServer generates the HFile name and queries the metadata server of the storage nodes that can store the HFile. Metadata server selects one storage node to store this new HFile based on a load balancing algorithm (e.g., round-robin, or hashing based). Then, the RegionServer sends the data of the new HFile to the selected storage node. Next, the storage node confirms the success of HFile write with the metadata server and the metadata server responds to the RegionServer for this successful HFile write. Finally, the RegionServer updates the HFile information including the HFile name, key-ranges, and other metadata information to the ISSN in that storage node.

As discussed in Section 5.1, servers in the compute cluster usually have local SSD or even non-volatile memory (NVM) to temporarily store data. Therefore, using local SSD as the secondary cache can further improve the overall performance of IS-HBase. For example, we can design a persistent cache, which has magnitude larger than DRAM cache size and stores the KV-pairs and the data blocks evicted from the DRAM cache. However, designing an efficient persistent cache is non-trivial, which will include the policy design, data management, garbage collection, data migration, and SSD wear-leveling considerations. Therefore, we will investigate the integration of local persistent cache with IS-HBase in the future work.

In the following subsections, we will introduce how the read (Get and Scan) queries and compactions are handled in IS-HBase in detail. Then, we will discuss the mechanisms to ensure data correctness and consistency in IS-HBase.

5.2 Scan and Get

In HBase, Get is a special case of Scan which is to only scan the first KV-pair that matches the key being searched. Therefore, we first introduce the process of Scan and then the process of Get.

5.2.1 Scan.

When using Scan API, a client specifies the start-key and a scan filter (if it is desired). HBase returns an Iterator handler to the client and the client calls Next() to scan through the consecutive KV-pairs with keys equal to or larger than the start-key. The scan length is determined by the client. For short scans (e.g., fewer than 10 Next() are called before the client releases the Iterator handler), it is efficient for the RegionServer to collect the KV-pairs from ISSNs and merge the results. Sending a data block (assuming a data block consists of more than 10 KV-pairs) to the RegionServer is wasting in this scenario because a large portion of the KV-pairs in the data block are irrelevant to the current Scan query.

However, when the scan length is long enough (e.g., scan through multiple data blocks), it is more efficient to send data blocks to the RegionServer and construct a two-level heap in the RegionServer to process and combine the scanning results. On one hand, a RegionServer can quickly read out the next KV-pair in memory instead of waiting for the round trip network latency (i.e., from the RegionServer to one ISSN and back). On the other hand, if the network condition is good, the latency of sending a data block is close to sending a single KV-pair.

Also, the scan filter requirement influences the scan performance. If most of the KV-pairs being scanned are filtered out, it is more efficient to apply the filter in ISSN and only returns the valid KV-pairs to the RegionServer. In this case, filtering is processed in parallel in all relevant ISSNs and the overall performance is better than filtering out the KV-pairs at the RegionServer. For example, a client requests the KV-pairs from a specific column and there are hundreds of columns. When Scan is called, in the worst case, more than 99% of the KV-pairs are filtered out. In this case, filtering out the KV-pairs at ISSNs are more effecient.

Therefore, it is necessary to maintain two different Scan mechanisms: (1) legacy scan, and (2) in-storage scan. IS-HBase uses different scan logics based on the network condition, scan length, and filter conditions. Supposedly, each data block has \( N \) KV-pairs in average, the scan length is \( L \), network available bandwidth is \( W \) (bits/s), network latency is \( T \), and the data block size is \( D \) bytes. The latency of sending a data block is about \( T + D*8/W \), and the latency of sending a KV-pair is about \( T + D*8/(N*W) \). With a simple comparison, if the latency of sending \( L \) KV-pairs is longer than sending the whole block, reading the whole block to RegionServer can achieve better performance (i.e., \( L*(T + D*8/(N*W)) \gt T + D*8/W \)). We can design a more sophisticated policy based on these parameters and other conditions which will decide if it is efficient to send data blocks to the RegionServer for Scan queries or not (discussed in Section 6 in detail).

For in-storage scan, we also maintain a two-level heap design which is similar to that of the legacy HBase. Differently, only the Level-2 heap is maintained in the RegionServer to combine and sort the KV-pairs from different ISSNs. A RegionServer uses RPC calls to communicate with the ISSNs in different storage nodes. Based on the start-key of the Scan query and key-range information of HFiles in different storage nodes, the RegionServer selects the storage nodes whose HFile key-ranges may contain the start-key. Then, the RegionServer starts the RPC calls and communicates with the ISSNs in those storage nodes. The start-key, scan filter, and other scan related information are sent to the relevant ISSNs.

At a storage node, when the ISSN receives the Scan RPC call from one RegionServer, it starts one HFile scanner for each HFile that satisfies one of the following conditions: (1) The start-key is within the key-range of the HFile, or (2) The start-key is smaller than the first key of the HFile. At the same time, the ISSN creates a Level-1 heap to combine the scanning results from all HFile scanners. The function of an HFile scanner is the same as the HFile scanner in legacy HBase. It reads out the metadata blocks from the HFile and searches the data blocks based on the start-key. If the KV-pair exists, it returns the KV-pair to the Level-1 heap. At the same time, the HFile scanner constructs an Iterator so the consecutive KV-pairs can be returned by calling Next() in the Level-1 heap. When the top KV-pair in Level-1 heap is sent to the Level-2 heap at the RegionServer, the ISSN calls Next() of an HFile scanner Iterator whose KV-pair is popped out to the Level-2 heap and inserts the new KV-pair returned by Next() to the Level-1 heap. If the HFile scanner is at the end of the HFile, the scan process for this HFile stops. A new HFile scanner may be created if another HFile has the consecutive key-ranges.

5.2.2 Get.

Get is a special case of Scan: return the first KV-pair to the client if the key is the start-key or return Null. Therefore, IS-HBase reuses the processing logic of Scan. For the Get query, an ISSN checks the top KV-pair in its Level-1 heap. If the key is the same as the start-key, the ISSN returns the top KV-pair to the RegionServer and then clean the HFile scanner and Level-1 heap. Otherwise, ISSN returns Null to the RegionServer. After the RegionServer receives all the results from relevant ISSNs, it compares the timestamp of all valid KV-pairs and returns the latest KV-pair in the Level-2 heap to the client. If all ISSNs return Null, the RegionServer sends Null to the client as the response of the Get query.

Figure 12 is one example showing how Get and Scan work in IS-HBase. Assume RergionServer 1 receives a Get query. Since the cache does not have the requested KV-pair, RegionServer 1 issues the request to the ISSNs who manage the HFiles containing the key of this Get. Then, each relevant ISSN constructs a number of HFile scanners in the storage node and searches the key. A KV-pair that matches the key is sent back to the RegionServer 1 and combined in its Level-2 heap. Finally, the KV-pair at the heap top is returned to the client. In the second case, RegionServer 1 finds the KV-pair in the cache, so it directly returns the KV-pair to the client. There is no communication happens between RegionServer 1 and any ISSNs. In the third case, a client issues a Scan query to RegionServer K. RegionServer K locates the HFiles that may contain the KV-pairs whose key equals to or is larger than the Scan start-key. Then, it issues the requests to the relevant ISSNs that manage those HFiles. Each of those ISSNs constructs a local scanner for each HFile which maintains a Level-1 heap in the scanner. At the same time, RegionServer K constructs a Level-2 heap at compute server and sorts the KV-pairs from different ISSNs in the heap. By calling Next(), the ISSN can continuously get the ordered sequential KV-pairs whose key is equal or larger than the start-key.

Fig. 12. One example that shows how IS-HBase supports Get and Scan.

5.3 In-Storage Compaction

In legacy HBase, one RegionServer selects a set of HFiles and combine them into one large HFile. The selected RegionServer relies on its Scan module to read out the KV-pairs from HFiles in order, cleans the invalid (deleted or updated) KV-pairs, and finally sequentially writes the remaining valid KV-pairs to a new HFile. If we do not offload the compaction process in IS-HBase, a compaction creates heavy network traffics to read and write HFiles and may cause performance degradation.

In IS-HBase, we design an in-storage compaction mechanism, which avoids the heavy traffic by performing the compaction in one of the storage nodes with the help of relevant ISSNs. However, several challenges need to be addressed including: (1) Since the HFiles of one RegionServer can be stored in different storage nodes, we need a new communication mechanism between RegionServers and ISSNs; (2) How to achieve compaction without reading data to RegionServers? (3) For load balancing and managing purposes, the storage system has its own policy to allocate the new HFiles. How to decide where the new HFile to be written to?

We propose the following design for in-storage compaction to address the aforementioned challenges. First, if one RegionServer decides to compact several HFiles, based on the compaction selection policy, the RegionServer selects a number of HFiles as the candidates (called candidate HFiles). Then, the RegionServer queries the storage system metadata services to get the addresses of the storage nodes of the candidate HFiles. Next, the RegionServer creates an empty HFile and writes it to the storage system. The file allocation and storing details are managed by the underlying storage systems. Therefore, the metadata server will return the address of the newly created empty HFile to the RegionServer. At this moment, the RegionServer starts an RPC call to the ISSN (called compaction-ISSN during the compaction process) of the storage node where the new empty HFile is stored. The compaction-ISSN will receive the address information of candidate HFiles and the permission to contact all relevant ISSNs in the storage nodes which stores those candidate HFiles.

After the preparation is finished, the compaction-ISSN starts the RPC calls to all the relevant ISSNs (called scan-ISSNs) that manage the candidate HFiles. The compaction-ISSN starts a Level-2 heap to aggregate the results from scan-ISSNs and all the scan-ISSNs will construct a Level-1 heap for the candidate HFiles at their storage server. The compaction-ISSN continues to pick the KV-pairs from the Level-1 heaps in all relevant scan-ISSNs, filters out the invalid KV-pairs, and appends the valid KV-pairs to the pre-stored new HFile. After the compaction is finished. The compaction-ISSN updates the metadata information of the new HFile in the metadata server and then informs RegionServer the completion of compaction. The RegionServer now can delete all the candidate HFiles from the storage system. To ensure data correctness, during the compaction, if any of the connection is lost or any of the read/write fails, the RegionServer terminates the compaction and restarts the compaction later. The newly generated obsolete HFile is deleted. One example of in-storage compaction is shown in Figure 13.

Fig. 13. One example of in-storage compaction is achieved by IS-HBase. RegionServer 2 selects HFile 12, 14, 21, and 53 as candidates to be compacted and writes the valid KV-pairs to HFile 66. The ISSN on storage server 2 is selected to execute the compaction which is called compaction-ISSN.

In general, since all the HFiles are immutable, in-storage compaction can run independently and has little influence on the client primary requests. Compared with the legacy compaction, in-storage compaction can save both the HFile reads and writes and also the CPU and memory resources in the RegionServer. The compaction itself is triggered by the HBase compaction policy, we will not discuss the possibility of control compaction trigger time in this work. Importantly, large scale compaction requires more CPU and memory resources, which will impact the storage server performance. Small scale compaction is preferred for in-storage compaction. The policy to decide whether in-storage compaction should be used or not will be our future work. The decision needs to consider the tradeoffs between network saving benefit and the limited resources at storage nodes.

5.4 Data Correctness and Consistency

In legacy HBase, KV-pairs are sorted and stored in the read-only HFiles. For read requests, it checks the KV-pairs in Memtable and cache first. Then, it reads out the data blocks from HFiles to further search the KV-pairs if a cache miss happens in both Memtable and cache. If the requested KV-pair does not exist, the RegionServer cannot locate the KV-pair with the key. If the KV-pair has been deleted, the RegionServer will find the key with a special tombstone or the KV-pair is deleted together with the tombstone during a compaction. The read process is executed by the RegionServer. By creating version views (similar to a snapshot of the database when the query is called) for each read query, it ensures data consistency. By querying the heap in a RegionServer, the RegionServer can always get the correct KV-pairs.

Differently, in IS-HBase, only Level-2 heap is maintained in the RegionServer. Level-1 heaps are created by different ISSNs according to the locations of HFiles. Therefore, several special mechanisms are needed to ensure data correctness and consistency. First, to ensure the correctness of KV-pairs being readout, a RegionServer needs to make sure the RPC connections to the relevant ISSNs are in correct status. Otherwise, the RegionServer returns an exception to the client. Second, it is possible that an HFile is corrupted or an HFile scanner cannot perform correctly. An ISSN will have to report the exception to the RegionServer and the RegionServer needs to decide whether to ignoring the exception, restarting the read, or terminating the read process. Third, it is possible that due to a bad network connection, the RPC process between the RegionServer and an ISSN may be time out. If the RegionServer fails to reconnect to a relevant ISSN for a number of times due to a failed ISSN, the RegionServer needs to fall back to the legacy mode. That is, directly reading out data blocks from the storage nodes and processing the data blocks in the RegionServer without help from ISSNs. If the read queries also fail in legacy mode, the RegionServer returns exceptions to the client.

5.5 Cache Design

Since a RegionServer receives only KV-pairs from relevant ISSNs instead of receiving data blocks containing KV-pairs for Get and some Scan queries in IS-HBase, the RegionServer can cache the KV-pairs instead of data blocks in memory. However, this may cause two challenges: (1) How to index and maintain the KV-pairs in cache such that it can support future Get and Scan queries? and (2) How to identify invalid KV-pairs during a compaction and evict these KV-pairs from cache?

In legacy HBase, a RegionServer only caches blocks (both metadata blocks and data blocks). During a KV-pair search, a RegionServer reads the correlated metadata blocks (e.g., indexing blocks) to memory such that the RegionServer can identify the sequence numbers of data blocks that are needed to respond the query. By searching the sequence numbers in the cache, the RegionServer can decide whether the data blocks should be read out from the corresponding HFile. With the read-in data blocks, the scan Iterator can find the start-key in the data block and get the Next KV-pairs by reading the consecutive KV-pairs.

For Get queries, by comparing the KV-pairs in Memtable, KV-pairs in the cache, and the KV-pairs from ISSNs with the same key, the RegionServer can decide which KV-pair is valid. However, for Scan queries, the RegionServer cannot decide if the KV-pair in the cache is the next KV-pair to be read out without comparing the KV-pairs from all Level-1 heaps in relevant ISSNs. Therefore, Scan will always bypass the KV-pair based cache.

The KV-pair based cache is achieved by a hash table with an LRU replacement policy. The name of the HFile that holds a KV-pair is used as the prefix of the cache key of the KV-pair. For Get queries, by checking the cache, a RegionServer may skip the checking of a HFile whose KV-pairs are already in the cache. Therefore, the RegionServer only sends the requests to the relevant ISSNs whose HFiles have a cache miss for certain keys. This reduces the number of reads at the ISSN side and the required network traffic if there are cache hits. When processing a Scan, the RegionServer bypasses the KV-pair cache and starts the requests to relevant ISSNs for requested KV-pairs or read out the data blocks directly as discussed in 5.2.1. In this process, only the received KV-pairs are cached. During a compaction, a RegionServer evicts the KV-pairs whose HFiles are selected to be compacted. It ensures that the cached KV-pairs are always valid.

Since we bypass the KV-pair based cache for all the Scan queries, the performance of Scan will be lower for some hot key-ranges whose KV-pairs are also requested by other Scan queries. Therefore, there is a need to design a self-adaptive cache, which keeps both KV-pairs and data blocks in memory to improve the performance of IS-Hbase. We will discuss this self-adaptive cache in detail in Section 6.

6 SELF-ADAPTIVE BLOCK CACHE

One big challenge of in-storage computing-based design is how to handle the caching issue. The place and the logic of processing data are changed in the new architecture. A new caching scheme is needed to handle different data types at the storage nodes as well as at the computing nodes.

As we discussed in Section 5.5, since IS-HBase gets KV-pairs from ISSNs instead of getting data blocks from storage nodes, maintaining a KV-pair based cache in a RegionServer is a straight forward solution to reduce cache misses. This can reduce the latency of Get queries if there is a cache hit. However, KV-pair cache cannot be used for Scan because a RegionServer cannot decide which KV-pair is the next one of the current selected KV-pair by only checking the KV-pairs in the cache. Always reading KV-pairs from storage nodes when responding to Scan queries can cause a high performance penalty. If the scan length is long, the overhead of reading data block can be smaller than reading a set of KV-pairs separately. Also, if the workload has a very good temporal and key-space locality (i.e., the nearby KV-pairs are accessed in a short period of time later), caching data blocks will have fewer cache misses. Therefore, further optimizing the cache for IS-HBase is essential.

To benefit Get and Scan queries, we propose a self-adaptive block cache for IS-HBase. It caches either KV-pairs or data blocks. To achieve a high cache space usage, the cache space boundary between KV-pairs and data blocks are dynamically changing based on the variation of the current workload. We will first introduce the architecture of the proposed self-adaptive block cache. Then, the process of the cache boundary adjustment is discussed.

6.1 Architecture of Self-Adaptive Block Cache

Caching has a trade-off between the exploration of key-space locality and the effectiveness of using cache space. Pure block-based caching can maximize potential key-space locality while it will have a high transmission overhead due to a much large data block size and a high cache space usage. Caching only KV-pairs may not be able to take the advantage of key-space locality, but it uses less caching space. Considering the pros and cons of the two possible cache designs, we propose a new caching scheme, which selectively caches either blocks or KV-pairs. However, maintaining a fixed size space of KV-pair cache and block cache cannot adapt to the variation of the current workload. Therefore, the space partition between KV-pair cache and block cache needs to be dynamically adjusted. There are several challenges that should be addressed in the new caching scheme. First, under what conditions, we need to read data blocks instead of KV-pairs from storage nodes to RegionServers? Second, with limited cache space, how to allocate the cache space to data block caching or KV-pair caching? Third, what is the eviction policy used in the cache?

To address the aforementioned issues, we propose a self-adaptive block cache for RegionServers to preserve the block key-space locality while trying to consume less cache space and network bandwidth with KV-pair caching. The architecture of the self-adaptive block cache is shown in Figure 14. In IS-HBase, a RegionServer maintains a unitary cache space that will be shared by both KV-pairs and blocks. Two different formats of blocks are cached, one is a typical complete block and the other is a partial block. The complete block is the data blocks read out from HFiles while the partial block is a variable-sized block whose size depends on the cached KV-pairs from the same HFile data block. A block sequence number, which uniquely identifies the block in IS-HBase is used to index both complete blocks and partial blocks. Besides, we also maintain the following information for each block: (1) a flag to indicate it is a partial block or a complete block, (2) a cache hit counter and a cache miss counter for each region (a region represents a key-range and it will be discussed later) in a complete block or a partial block and the sum of these counters is the access number of the block, and (3) the order of partial blocks and their corresponding sizes. Cache hit and miss counters are used to measure the hotness of a region in a block. The locations of partial blocks in a block are discussed in the next subsection.

Fig. 14. The overview of self-adaptive block cache.

Although caching KV-pairs and blocks separately is easier to implement, adjusting the cache space boundary between them is challenging. Specifically, if we cache KV-pairs in an independent space without knowing the blocks of the KV-pairs, we are not able to measure the key-space locality of these blocks. It is necessary to maintain the statistics for the partial blocks although only some of its KV-pairs are in the cache. For example, if a partial block has a high number of cache misses, we should allocate more space and accommodate more KV-pairs from this block to reduce the cache misses.

6.2 Self-Adaptive Cache Adjustment

We propose a self-upgrade/downgrade adaptive scheme to allocate and reclaim the cache space for partial blocks/complete blocks to achieve a higher caching hit ratio and cache space efficiency. Upgrade is a process used to allocate more cache space for a partial block when it is suffering from cache misses. A large number of cache misses to the same partial block indicates that the data block has a strong key-space locality and need more cache space to accommodate more KV-pairs in the partial block. In certain cases, merely caching individual KV-pairs one by one may still suffer from compulsory misses (i.e., a miss occurs at the first access). Thus, we make an upgrade by acquiring a complete data block to replace that partial block.

The downgrade is a process to evict a complete block and only cache the hot KV-pairs in the block. That is, a complete block contains hundreds of KV-pairs and not all of them need to be kept in cache. In this case, it is more efficient to only cache the hot KV-pairs rather than caching the whole data block. However, identifying hot and cold KV-pairs in a complete block is still an issue. We rely on a hotness sorting method (described in the following subsections) to divide a complete block into two regions: hot region and cold region. During the downgrade process, cold KV-pairs from cold region are evicted, and hot KV-pairs are downgraded to a partial block. In some other cases, if the whole data block is cold, it can be directly evicted from the cache. With this self-upgrade/downgrade scheme, the partition between the KV cache and data block cache is dynamically determined. A RegionServer will keep requesting KV-pairs at most times and only acquires a complete block when there is an upgrade.

Self-Upgrade. When the miss counter of a partial block reaches to a threshold, this data block will be selected as a candidate to be read out from the storage node and cached in the RegionServer as a complete block. The miss threshold is set to proportional to the amount of network traffic. When the network traffic is heavy, it is better to transmit KV-pairs rather than sending the whole block such that the network traffic can be relieved. In this case, the threshold needs to be high enough to intercept most dispensable block transmissions. With an upgrade of a partial block, the future cache misses of the KV-pairs in this block can be effectively reduced. However, since a partial block is smaller than a complete block, upgrading a partial block to a complete block will require more cache space. To make room for upgrade, some of the other partial blocks or one complete block may need to be evicted to reclaim the space for the new complete block.

We will first check if there is any complete block that is cold enough to be evicted. A cold complete block is a block with a lower recent access number (i.e., hotness, formulated in next paragraph). If this number is lower than the recent access number of the partial block which is waiting for an upgrade, the complete block will be downgraded or totally evicted. If there is no such cold complete block, a RegionServer will follow an LRU policy to choose some of the partial blocks to evict. The new complete block will inherit the cache hit and miss counters of the original partial block. To exclude the interference of stale statistic information (i.e., miss and hit numbers), once a block is evicted, the access counters of each evicted block will be reset to 0.

Self-Downgrade. This process evicts the cold KV-pairs from complete blocks and only caches the hot KV-pairs as partial blocks. As mentioned, the downgrade will be triggered when there is a partial block that wants to upgrade as a new complete block, in order to reclaim space for the new complete block, old and cold complete block will be downgraded or totally evicted. If a small portion of the KV-pairs in the old block is hot and other KV-pairs are rarely accessed, this block will be downgraded, otherwise be evicted. When downgraded, a partial block is created to cache those hot KV-pairs and thus the complete data block is discarded.

However, monitoring the hotness of individual KV-pairs has high memory overhead, which cannot be used in IS-HBase. According to [14], hot KV-pairs are usually located closely in the key-space. Therefore, instead of monitoring the hotness of individual KV-pairs, we divide the key-space of KV-pairs in a data block into several regions. KV-pairs in cold regions are evicted from the cache.

Monitoring the hotness of KV-pair regions is challenging. In this article, we use a simple method to calculate the hotness of one region, which reflects both recency and frequency. In each period of time, the hotness of one KV-pair region is recalculated. Suppose the hotness of the last monitoring period is \( H^{i-1} \), the total access count of this region during the current monitor period is \( R^{i} \), and the hotness of the current monitoring period is \( H^{i} \). We have \( H^{i} = \alpha \times R^{i}+(1-\alpha)\times H^{i-1} \). \( \alpha \) is the weight that can reflect the importance of the access count in the current monitoring period. For a workload whose spatial locality changes fast, a region that was hot in the last monitoring period may be cooling down quickly in the current monitoring period. In this case, \( \alpha \) should be high enough to give more weight to the access count in the current period. Otherwise, for some workloads with stable spatial locality, a low \( \alpha \) enables a long-term hotness observation. In our experiments, we adopt the value of \( \alpha \) based on the workload we use.

Hotness is not only used to measure regions but also to measure complete blocks and partial blocks. For a complete block, its hotness is the sum of the hotness of all regions. While for a partial block, it will be considered as having only one region and calculate its hotness. When choosing a complete bock to downgrade, its hotness should be lower than the partial block that waits to be upgraded. When one complete block is selected to be downgraded, we first sort its KV-pair regions based on the hotness in descending order. According to the space that a new complete block requires, we evict the regions with the corresponding size in a cold and old block.

As discussed, both self-upgrade/downgrade are processes used to make a balance between cached KV-pairs (partial blocks) and complete blocks. However, these approaches are only used when there is any partial block or complete block that needs a change. Therefore, a RegionServer still employs an LRU policy as the regular cache replacement policy. Since acquiring a complete block only occurs in upgrades, all regular insertions and evictions are on partial blocks. An insertion will create a new partial block if there is no such a partial block while an eviction may delete a partial block that is least recently used.

6.3 Get, Scan, and Delete with Self-Adaptive Block Cache

The logic flows of Get and Scan queries are shown in Figure 15. For a Get, it first checks with the metadata of the HFile to find the corresponding block sequence number and then uses this block sequence number to check if there is such a block or partial block in the cache. Once there is such a block, no matter a complete block or a partial block, we will use a binary search to find the requested KV-pair. However, unlike a complete block, a partial block may not contain that KV-pair. In this case, the miss counter of that partial block will be increased by 1, and the RegionServer will send a Get request to the relevant ISSNs to access that KV-pair as mentioned in Section 5.2.2. On the other hand, if there is no such a block in the cache, the RegionServer will also send a Get request with the block number and the requested Key to the relevant ISSNs. Once the requested KV-pair has been received, it will be inserted into the cache if there is such a partial block. Otherwise, the RegionServer will first create a new partial block with that block number and then insert this KV-pair.

Fig. 15. The logic flows to handle Get and Scan queries with self-adaptive block cache.

As discussed in Section 5.2.1, a Scan can be considered as a sequence of Next() calls. Therefore, a Scan will first do the same operations as Get. If there is no such complete block containing the start-key in the cache, it will send a request to the relevant ISSNs with the start-key. After receiving the corresponding KV-pairs, the RegionServer will also insert those KV-pairs into a newly created partial block. However, if that block exists in the cache as a partial block, the partial block cannot serve the Next() calls of Scan since a partial block only contains some KV-pairs and cannot guarantee the next KV-pair is in the partial block as occurs in a complete block. Thus, if a partial block exists, we do not check that partial block and consider it as a cache miss. If the complete block which contains the start-key is in the cache, we can scan from the start-key and keep requesting the next KV-pair one by one from this block. If this block has been searched through, we need to check the metadata block to get the next block number and do the same operations as the aforementioned.

Sometimes, some of the KV-pairs will be deleted or updated by the client (via Put or Delete). In both cases, a deletion tombstone or a new KV-pair with updated value will be cached in the Memtable and flushed out in a new HFile later. In the read process, all the access to a certain key will be returned with the latest value which is a Not Found (deleted) or a new value (updated). After the old HFiles are compacted during the compaction process, all the cached KV-pairs and blocks from those compacted HFiles will not be accessed anymore. Since these KV-pairs and blocks will no longer be accessed, they are gradually evicted from the cache without any consistency or correctness issue.

7 PERFORMANCE EVALUATION

In this section, we present a comprehensive evaluation of IS-HBase. To quickly evaluate the concept of IS-HBase, we implement two different systems. We conduct a basic IS-HBbase evaluation in Section 7.1, in which we port one light weighted RegionServer instance at each storage node and use it as an ISSN to fully utilize its scanning capability. This can emulate the process of reading KV-pairs and compaction of IS-HBase in a precise way. In these evaluations, we focus on the validation of reducing NET-AM, improving QPS, and reducing latency achieved in IS-HBase. In order to evaluate the design effectiveness of communication protocols of IS-HBase, ISSN, in-storage compaction, and different cache designs, we develop a simulator called HBase-sim and the evaluation results of HBase-sim are discussed in Section 7.2. HBase-sim can generate KV workloads with different features and distributions.

7.1 Basic IS-HBase Evaluation

In this evaluation, a lightweight instance of RegionServer is deployed at each storage node and it acts like an ISSN performing scanning function, while another copy of RegionServer running on another sever acts as a RegionServer with its scanning function disabled. Note that, deploying a lightweight instance of RegionServer at storage nodes is a way of validating the concept of ISSN. The production level ISSN will be more simple and specifically designed for IS-HBase. The emulated ISSN is responsible for data pre-processing at the storage node, the RegionServe combines the results and responds to client requests. We evaluate Get and Scan queries with column filter operations. The size of a KV-pair is 1 KB. Note that, the Legacy HBase evaluated with Uniform and Zipfian KV-pair distribution workloads are called Legacy Uniform and Legacy Zipfian, respectively. The IS-HBase evaluated with Uniform and Zipfian KV-pair distribution workloads are called IS-HBase Uniform and IS-HBase Zipfian, respectively.

First, we evaluate and compare the network traffic of legacy HBase and the proposed IS-HBase during the write process. In the test, we insert 50 GB KV-pairs (50 million KV-pairs) to the database. We collect the network traffic amplification (i.e., the generated network traffic divided by the total 50 GB data size) of legacy HBase and IS-HBase after finishing the insertions. As shown in Figure 16, the amplification of IS-HBase does not change much when the number of KV-pairs being inserted varies from 10,000 to 50 million. The network amplification of IS-HBase is caused by flushing HFiles and WAL writes. However, in Legacy HBase, during insertions, HFiles are continuously read out to a RegionServer to be compacted and written back as new HFiles. It causes a serious network amplification. As more KV-pairs are inserted, the amplification of legacy HBase varies from 2 to 18. IS-HBase is able to execute the compactions between ISSNs and avoids the additional network traffic.

Fig. 16. The network traffic amplification comparison during loading 50 million KV-pairs between legacy HBase and IS-HBase.

Second, we evaluate the network traffic during read process. After 50 million of KV-pairs are inserted to HBase, we use YCSB to issue the Get queries with both Uniform and Zipfian distributions. The read amplification comparison between Legacy HBase and IS-HBase is shown in Figure 17. By pre-processing the data blocks locally in the storage nodes, IS-HBase can obtain 95% and 97% of network traffic reduction with Uniform and Zipfian distribution workloads respectively compared with the legacy HBase. The network traffic reductions are irrelevant to the network condition.

Fig. 17. The network traffic amplification comparison of Get queries between legacy HBase and IS-HBase.

Third, we compare the QPS of Get queries when the network bandwidth varies in Figure 18(a). By setting the Open Vswitch, we vary the available network bandwidth between the RegionServer and the Storage node from 1 Mbps to 800 Mbps. As shown in Figure 18(a), when the bandwidth becomes lower than 128 Mbps, the QPS of legacy HBase drops quickly and is much lower than that of IS-HBase. We can speculate that the network bandwidth requirement from a RegionServer for certain workloads is around 128 Mbps. When network bandwidth is higher than 128 Mbps, the processing capability of RegionServer constrains the performance of HBase. When the bandwidth is lower than 128 Mbps, the network becomes the performance bottleneck. In this situation, since IS-HBase processes the data blocks in the storage node and only the requested KV-pairs are sent through the network, the QPS of IS-HBase does not have any explicit degradation. Only when the network bandwidth is extremely limited (e.g., the network bandwidth is lower than 4 Mbps), the QPS of IS-HBase decreases.

Fig. 18. The QPS and latency of Get queries variations when the network bandwidth varies.

The average latency of Get queries is shown in Figure 18(b). As the available network bandwidth decreases from 800 Mbps to 1 Mbps, the average latency of IS-HBase remains at around 24 ms without significant increases. In legacy HBase cases; however, the average latency starts increasing dramatically after the available bandwidth drops below 128 Mbps. When the available bandwidth drops from 128 Mbps to 1 Mbps, the average latency of legacy HBase increases to about 1,000 to 2,000 ms (increases more than 50 times), while the Get latency of IS-HBase only increases about three times. As we can see, processing data blocks in storage nodes can significantly reduce the required network bandwidth and avoid explicitly increased latency under limited network bandwidth. When the network bandwidth is low, IS-HBase can still achieve a high performance. When the network condition is good, the performance of HBase is close to that of IS-HBase. The results are similar in both Uniform and Zipfian distribution based workloads. Differently, the performance of either legacy HBase or IS-HBase is usually lower in the Uniform distribution based workloads due to fewer cache hits.

Finally, to evaluate the performance of Scan queries with different filtering ratios, we first insert a big table to HBase. The total data is 20 GB with 200,000 rows and 100 columns (20 million KV-pairs in total). For one Scan query, it scans from the first KV-pair of a row to its end (100 consecutive KV-pairs in total) and only returns the KV-pairs from some of the columns according to the filter requirement. We request 20 MB data (20,000 KV-pairs) in each test with the column filter ratio varying from 0% to 95%. If the filter ratio is 95%, only 5% of the KV-pairs being scanned in a Scan query are returned to client and 4,000 Scan queries are issued. If the filter ratio is 0%, all KV-pairs from one row are returned (100 KV-pairs) for one Scan query and 200 Scan queries are issued. We measure the NET-AM in this experiment

As shown in Figure 19, the X-axis shows different filter ratios and Y-axis is the obtained NET-AM. Instead of sending all the data blocks to the RegionServer, the ISSN at the storage node processes the data blocks locally and only returns the requested KV-pairs from certain columns to HBase. Therefore, the NET-AM of IS-HBase is much lower than that of legacy HBase. Moreover, the network amplification factor is almost stabilized around 1.7 when the column filter ratio varies.

Fig. 19. The NET-AM variations when different Scan filter ratios are used.

7.2 HBase-Sim Evaluation

We implemented the HBase-sim that simulates the functionality of Storage Node, RegionServer, ISSN, and different cache designs. The simulator is deployed on a Dell PowerEdge R430 server with a 2.40 GHz Intel Xeon with a 24-core CPU and 64 GB memory. We run each test three times and present the average results.

Real-world workloads can be significantly different from each other. To cover different situations, we summarize several factors that lead to these differences, and generate workloads with different configurations of these factors. The IS-HBase design consists of multiple components and its performance is influenced by the factors such as network condition and KV access distributions. To test the effectiveness of each component, we run a series of ablation study which evaluates the IS-HBase with different components. Finally, we compare our IS-HBase with self-adaptive block cache with the legacy HBase. We use QPS as the measurement metrics which has been discussed in Section 4.

In our simulation, we focus on the interaction between a single RegionServer and a single Storage Node. The reason we use a single RegionServer is that each RegionServer is responsible for only a certain region (i.e., a specific range of keys). Therefore, the processing of user requests is independent among RegionServers. Thus there is no need to simulate all the RegionServers in our experiments. The HFiles stored in the Storage Node is of 1 GB in size which consists of 16,384 blocks. The block size is configured as 64 KB.

7.2.1 Workload Generation.

We also developed several workload generators as benchmarks in HBase-sim. For Get requests, the keys with different access probabilities are generated in the benchmark. For Scan requests, the start key, Scan length, and filter condition follows different distributions are also generated. In each test, the benchmark issues 1.5 million queries.

We decompose workloads based on different factors. During generation, the following factors are taken into account:

—	KV-pair size: As the KV-pair size affects the efficiency of sending data from storage node to RegionServer, we tested both small and large KV-pair sizes.
—	Network condition: Different network bandwidths lead to different latency when transmitting the same amount of data. We run the test with both fixed and dynamic network bandwidths.
—	KV-pair access distribution: KV-pair access distribution affects both KV-pair hotness and block hotness. To fully evaluate the system performance under different data locality patterns, we test HBase-sim with different KV-pair access distributions. We apply both uniform and Zipfian distributions to different workloads.
—	Scan length and filter ratio: Both the scan length and filter ratio have an explicit impact on the effectiveness of applying ISSN (i.e., transmitting KV-pairs vs. data blocks). When the scan length is short (e.g., less than 5), or filter ratio is very high (e.g., more than 99%), only a small number of KVs in the block are accessed when calling Next() in iterator. In this case, it is more efficient to read out the requested KV-pairs directly from ISSN instead of reading data blocks. Moreover, high filter ratio can lead to “partial hotness” in the block, which means only a small portion of KV-pairs in a block are accessed frequently.
—	Get-Scan ratio: User requests can be Get-intensive or Scan-intensive, which can affects KV/Block cache partition. In our simulation, different Get and Scan query ratios are evaluated. Based on [8, 14], the ratio between Get and Scan in real-world workload is about 92:8. In our simulation, we dynamically adjust the Get-Scan ratio during the running time while ensuring the overall ratio remain 92:8.
—	Read-Write ratio: To comprehensively show how the overall performance changes with both read (i.e, Get and Scan) and write (Put) requests, we also evaluate the QPS with different read-write ratios.

Each of the five factors aformentioned has various configurations. We set a default configuration for each factor. If not specified in the following tests, the default configurations are adopted. Specifically, the default KV size is 64 bytes, which is considered to be a small one. The default network bandwidth pattern is dynamic since 512 Mbps is already high enough as shown in Figure 10(a), we generate the network bandwidth using a Sine function whose minimum value is 0 bpss, maximum value is 512 Mbps, and minimum positive period is 300 requests. The default KV-pair access distribution is Zipfian, as it is much more common than the uniform distribution. The default Scan length is dynamic and its average value is 30. The default scan filter ratio is 0% (no KV-pair is filtered out) and the default Get-Scan ratio is 92:8 [8, 14].

To comprehensively study the system performance under different scenarios, we conducted several ablation and independent studies. In the ablation study, we analyze the components in our IS-HBase design with self-adaptive block cache, and test its performance with other baselines. After that, we run several independent studies using both the legacy HBase and our IS-HBase with self-adaptive block cache. In each independent study, we fix four factors listed above to default values while test different settings of the remaining factor.

7.2.2 Ablation Studies.

We use the legacy HBase without ISSN as the baseline, in which only blocks are transmitted, plus with several IS-HBase variations. More specifically, we summarize our design as an integration of the following components:

—	ISSN component: The ISSN in a storage node is the offloaded scanner. It has the ability to scan HFiles and return the requested KV-pairs or blocks. If ISSN is the only component in the system, it will always transmit the requested KV-pairs. With other components (as described below), ISSN may send either KV-pairs or blocks to the RegionServer.
—	Network-bandwidth-aware component: The system switches between “transmitting KV-pairs” or “transmitting blocks” according to the network condition. Intuitively, when the network bandwidth is reasonably high, it normally transmits KV-pairs.
—	Inter-block-hotness-aware component: The cache replacement policy of complete block in RegionServers is priority-based, and the priority is based on the access frequencies of complete and partial blocks.
—	Intra-block-Hotness-aware component (a.k.a. complete-block-down-grade component): A complete block will discard half of its cold KV-pairs and be down-graded to a partial block if only a small portion of KV-pairs in the block is accessed frequently.
—	Partial-block-upgrade component: A partial block will be upgraded to a complete block if the miss ratio of this partial block is high. Together with the complete-block-down-grade component above, we implemented an adaptive partial/complete block ratio based on the block access locality.

Therefore, in our simulation, we compare the performance of the following five different HBase setups based on the implementation of components in the system.

(1)	HBase-B: This is the legacy HBase with a Block cache in RegionServer using LRU cache replacement policy. For both Get and Scan requests, RegionServer always reads data blocks to its block cache. This is considered to be the setup of legacy HBase.
(2)	IS-HBase-K: This is IS-HBase with a KV-pair cache in RegionServer using LRU cache replacement policy. For both Get and Scan requests, RegionServer always reads KV-pairs to its KV-pair cache. The KV-pair cache is only used in Get requests. For Scan requests, it does not check the KV-pair cache. Instead, it requests KV-pairs from relevant HFiles directly. This is the basic version of IS-HBase to evaluate the efficiency of ISSNs.
(3)	IS-HBase-KB: This is IS-HBase with both a KV-pair and a Block cache in RegionServer using LRU cache replacement policy. The KV-pairs and block caches have the same fixed size memory space. The transmitting decision is based on the request type. For Get requests, the KV-pair cache is used and it will always transmit KV-pairs. For Scan requests, the block cache is used and it will always transmit data blocks. This is the baseline of a hybrid cache design.
(4)	IS-HBase-KB-NET: The cache design of IS-HBase-KB-NET is the same as IS-HBase-KB. The only difference is that IS-HBase-KB takes currently available network bandwidth into consideration. For Get requests, it will always transmit KV-pairs. But for Scan request, it will transmit KV-pairs when the network available bandwidth is lower than \( \theta \), and will transmit blocks otherwise.
(5)	IS-HBase-SA: This is the completed version of IS-HBase, which has Self Adaptive block cache in RegionServer. The partial block cache uses LRU replacement policy while the complete block cache applies a priority-based cache replacement policy. The partition between partial and complete block cache is determined by the partial-block-up-grade component and the complete-block-down-grade component.

Evaluation Results. We use the aforementioned default configurations to evaluate different designs of HBase. As Figure 20 shows, the average QPS in legacy HBase-B is 292.53, which is the lowest. The reason is that it always transmits blocks instead of KV-pairs through the network. When only a small number of KV-pairs are requested, it leads to a higher performance overhead, especially when the network bandwidth is low. When the KV-pair cache is enabled, the transmitting overhead typically decreases significantly. Our results show that the average QPS in IS-HBase-K is 1,053.03, which is 3.60 times of the QPS in the legacy HBase-B. In terms of IS-HBase-KB, although hybrid cache is adopted, it does not have a smart transmitting strategy between these two. The average QPS in this configuration is 1,018.28. When the network condition is bad or the scan length is very short, it is more beneficial to transmit only the requested KV-pairs rather than the whole blocks for Scan requests. The IS-HBase-KB-NET is a smarter design, it uses the hybrid cache and takes the available network bandwidth to adjust the policies. Its average QPS is 1,280.49. Finally, when it comes to our IS-HBase-SA design, all the former optimizations are adopted. Besides, the cache is self-adaptive and will automatically assign either the partial or full-size blocks during cache insertion. The average QPS is 1,330.13, which is the highest.

Fig. 20. The average QPS of different HBase designs.

To evaluate whether the CPU performance has explicit influence on the overall performance of IS-HBase, we limited the execution speed of ISSN on storage nodes. As we can see from Figure 21, changing the execution speed of the storage nodes barely affects the overall QPS. As a matter of fact, the ISSN is not CPU-bounded. With the default experiment configuration, in our IS-HBase-SA design, the CPU execution time of the whole system accounts for only 5.64% of the total time. Most of the time is consumed by network transmission and I/Os.

Fig. 21. QPS with different storage node execution speed, e.g., 25% means the execution speed on the storage node is 25% of the computing node.

7.2.3 Independent Studies.

We run 5 groups of experiments to compare the influence of each factor on HBase-B (Legacy HBase) and IS-HBase-SA (our new design) including: (1) KV-pair size: the key size in a KV-pair is configured as 8 bytes while the value size is set to 56 bytes and 1,016 bytes (i.e., the KV-pair size is either 64 bytes or 1,024 bytes respectively); (2) Network condition: the network bandwidth is set to be 512 Mbps. We test different ratio of available network bandwidth for HBase, and the resulting available network bandwidths are 1 Mbps, 2 Mbps, 4 Mbps, 8 Mbps, 16 Mbps, 32 Mbps, 64 Mbps, 128 Mbps, 256 Mbps, and 512 Mbps; (3) KV-pair access distribution: For Get queries, the keys are generated using either uniform or Zipfian distribution. For Scan queries, the start key is generated by the same distribution as Get queries, while the average scan length is set to 30; (4) Scan filter ratio: the filter ratio is set to either 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%; (5) Get-Scan ratio: we change the Get-Scan request number from 0:8 to 184:8 during the running time, while maintaining the overall average ratio as 92:8 (after the whole test is finished). (6) Read-Write ratio: we test the performance with read-write-mixed workloads, the read ratio is set to either 10%, 20%, 30%, 40%, 50%,60%, 70%, 80%, or 90%.

Evaluation Results. Figure 22 shows the average QPS of the legacy HBase-B and our new IS-HBase-SA design under different network condition. Generally, the throughput of both designs increases when the network bandwidth increases. The average QPS of HBase-B is roughly linearly proportional to the available network bandwidth, as it always transmits blocks. On the other hand, in the IS-HBase-SA design the system reads KV-pairs to memory as partial blocks first and upgrades the partial blocks to full-size ones later. The QPS of IS-HBase-SA is higher than that of HBase-B when the available network bandwidth is under about 256 Mbps as shown in Figure 22. If the network condition is very good, the latency of transmitting a block is not high while sending multiple KV-pairs needs several RPCs. Therefore, it is likely that transmitting the whole block is faster in this case. This is the reason that the legacy HBase design performs well with very high available network bandwidth.

Fig. 22. QPS independent study on network condition.

To further validate how the new cache design in IS-HBase-SA works during different network conditions, we collect the number of cache upgrades per 5,000 queries in Figure 23 and cache downgrades per 5,000 queries in Figure 24 under four different network bandwidth. As we can see from Figure 23, when the available network bandwidth is higher, there will be more cache upgrades. Based on our cache design, higher network bandwidth leads to less overhead when transmitting blocks, which will upgrade more partial blocks to complete blocks. There are more cases where transmitting blocks is more efficient than sending KV-pairs. As we mentioned in Section 6.2, a downgrade happens only when a partial block wants to upgrade as a new complete block, so there will also be more downgrades when the available network bandwidth is higher.

Fig. 23. Upgrade frequency of IS-HBase-SA per 5000 queries with different available network bandwidth.

Fig. 24. Downgrade frequency of IS-HBase-SA per 5000 queries with different available network bandwidth.

The average QPS of the legacy HBase-B and our IS-HBase-SA under different Get-Scan ratios are shown in Figure 25. The X-axis in the figure represents the ratio of Get requests in a Get-Scan-mixed workload. Both designs have a higher QPS when Scan requests’ ratio is higher, and IS-HBase-SA is much better than that of HBase-B. Each Next() call of a Scan request is considered to be a new query. When there are more Scan requests, both designs can leverage more locality as they both can transmit Blocks and thus have a better performance. However, HBase-B always reads blocks instead of KV-pairs. It wastes cache space for the data that has no locality, so (1) its performance is much worse than IS-HBase-SA; and (2) its performance is not stable when facing workloads with different Get-Scan ratios.

Fig. 25. QPS independent study on Get-Scan ratio. The x-axis represents the ratio of Get queries in a Get- Scan-mixed workload.

Also, different Get-Scan ratios influence the number of cache upgrade and downgrade. As we mentioned in Section 6.2, Scan will benefit more from the cache upgrade. To further validate how the cached blocks are upgraded or downgraded with different Get-Scan ratio, we collect the number of upgrades per 5,000 queries in Figure 26 and the number of downgrades per 5,000 queries in Figure 27 with three different Get ratios. As we can see from Figures 26 and 27, when the Get ratio is lower (Scan ratio is higher), the numbers of both upgrades and downgrades are higher. It verifies that more Scan requests lead to more upgrades because Scan can benefit from the consecutive KV-pairs in the complete block. Note that, when the Get ratio is higher (e.g., higher than 90%), the number of upgrades and downgrades will drop dramatically. In our benchmark, the start keys of both Get and Scan follow the same Zipfian distribution pattern. They should cause nearly the same number of cache misses, which is an important criterion to upgrade the partial blocks. Only the Next() of Scans will triggers extra cache misses. Therefore, as shown in Figure 26, when the Scan ratio is low enough (i.e., 95% of Get and 5% of Scan), the number of partial block upgrades will drop dramatically.

Fig. 26. Upgrade frequency of IS-HBase-SA per 5,000 queries with different get ratio in a Get-Scan mixed workload.

Fig. 27. Downgrade frequency of IS-HBase-SA per 5000 queries with different Get query ratio in a Get-Scan mixed workload.

Figure 28 shows the average QPS of the legacy HBase-B and our HBase-SA design under different KV-pair access patterns. As we can see, both designs prefer Zipfian distribution since there are more data localities.

Fig. 28. QPS independent study on key access distribution.

Figure 29 shows the average QPS of the legacy HBase-B and our HBase-SA design under different filter ratio. IS-HBase-SA has a much higher QPS than the legacy HBase-B because it tends to transmit only the required KV-pairs. The legacy HBase-B does not consider the filter ratio and always take the whole block as a read unit. Thus, the filter ratio has no influence on its performance. On the other hand, a higher filter ratio leads to fewer KV-pairs that are needed in a block so the read amplification is increased. IS-HBase-SA has a little number of false upgrades when facing read amplification. Therefore, as the filter ratio increases, the performance of IS-HBase-SA slightly goes down.

Fig. 29. QPS independent study on filter ratio.

The QPS of both HBase-B and IS-HBase-SA with the read-write-mixed workloads are shown in Figure 30. In this experiment, we mix read and write requests with different ratios. The x-axis represents the ratio of read requests in a read-write-mixed workload. For read requests, we use the default configuration of the Get-Scan ratio, which is 92:8; we use the Put operations as write requests. The QPS of IS-HBase-SA is much higher than that of HBase-B on every single read-ratio configuration from 10% to 90%. Moreover, one interesting observation of Figure 30 is that when the read ratio is higher, our IS-HBase-SA has a better performance. In contrast, the performance of the legacy HBase-B is worse. Note that IS-HBase-SA and HBase-B have the same logic to handle write requests. Unlike read requests, where we may transmit and cache data that will never be used in the future, there is no amplification for write requests in both IS-HBase-SA and HBase-B during flush. When the number of write requests reaches a threshold, both IS-HBase-SA and HBase-B will flush the Memtable to storage nodes. Nevertheless, for the read requests, HBase-B performs badly since it always transmits the whole Block, while IS-HBase-SA adaptively chooses when to transmit either Blocks or KV-pairs. Therefore, the overall average QPS (combine both read and write) of IS-HBase-SA is better than that of HBase-B. As a result, with a higher read ratio, the overall QPS under the read-write-mixed workload is higher in IS-HBase-SA, while it is lower in HBase-B.

Fig. 30. QPS independent study on read-write ratio. The x-axis represents the ratio of read requests in a read-write-mixed workload.

The independent studies show the influence of different workload patterns, and our IS-HBase-SA design performs well in most cases. Also, as the system scale increases, the network traffic will become more complex and heavy. Moving the data-intensive processing functions close to data not only solves the performance penalty of HBase, but also ensure the performance of other services since less network traffic is generated.

7.2.4 In-storage Compaction vs. Legacy Compaction.

In this section, we conducted several experiments with various configurations to demonstrate the efficiency of in-storage compaction (IS-Compaction). More specifically, we change the number of storage nodes, the number of HFiles on each storage node, as well as the available network bandwidth both at the RegionServers and storage nodes to show the performance improvement of IS-Compaction design in different scenarios compared with compaction in legacy HBase (Legacy Compaction). To compare the performance difference between IS-Compaction and Legacy Compaction, we measured the execution time to compact the same set of HFiles under the same configurations. We will show the normalized time in each figure, which means one data point shows the percentage of this execution time divided by the maximum time in the figure. In all experiments, we assume RegionServers are also responsible for other services, thus a portion of their network bandwidth is already occupied by the primary services of HBase. Storage nodes and HFiles. IS-Compaction avoids heavy data traffic between storage nodes and RegionServers. We compared the performance of IS-Compaction and Legacy Compaction, as shown in Figure 31. The suffix number in the legend indicates the number of HFiles on each storage node. For example, IS-2 means it is an IS-Compaction where each storage node has 2 HFiles to be compacted. The X-axis represents the number of storage nodes involved in the current compaction, while the Y-axis shows the execution time divided by the maximum value. As we can see, with the same configurations, IS-Compaction takes much less time to complete compared with that of Legacy Compaction. For example, if we consider the case where there are 4 storage nodes involved and each storage node has 4 HFiles to be compacted (i.e. comparing IS-4 with Legacy-4 at the X coordinate = 4 when 16 different HFiles are compacted in total), the execution time of IS-Compaction only accounts for 44.63% of Legacy Compaction. IS-4 performs even better than Legacy-2, in which IS-Compaction has two more HFiles to be compacted on each storage node than Legacy Compaction.

Fig. 31. Normalized execution time of both in-storage and legacy compaction. The suffix number in the legend means the number of HFiles on each storage nodes. The X-axis represents the number of storage nodes, and the Y-axis shows the execution time divided by the maximum value.

If we consider the influence of the number of HFiles on each storage node (e.g., IS-1, IS-2, and IS-4), we can see that more HFiles on each storage node leads to longer execution time. This is because it takes more time to build the Level-1 heaps in ISSN at each storage node. Similarly, if we increase the number of storage nodes, it also takes more time to finish compacting because more KV-pairs are involved to build the Level-2 heap on the storage node that stores the newly created HFile. Moreover, the performance of IS-Compaction degrades much less than that of the Legacy Compaction when we increase the same number of storage nodes or HFiles (e.g., IS-4 and Legacy-4). Network Bandwidth. We evaluated the performance of IS-Compaction and Legacy Compaction with different available network bandwidth. We consider the case where there are three storage nodes in total and each storage node has three HFiles to be compacted. In Figure 32, we set the available network bandwidth ratio of RegionServers at different levels. We can find that IS-Compaction performs much better than Legacy Compaction. The execution time of IS-Compaction is only about 6.31% to 41.84% of the execution time of Legacy Compaction in this figure. IS-Compaction avoids heavy data traffic between the RegionServer and storage nodes, the data transmitted between the RegionServer and storage nodes are the control data and metadata rather than KV-pairs, as we mentioned in Section 5.3. This data is relatively small (e.g., 320 bytes). For Legacy Compaction, all the KV-pairs in these HFiles are transmitted from several storage nodes to one RegionServer during compaction. Typically, the data size transmitted between RegionServer and storage nodes in IS-Compaction is less than 0.001% of the total HFile size to be compacted, which is the amount of data being transmitted in Legacy HBase. This is also the reason why the execution time of IS-Compaction stayed unchanged when the available network bandwidth between the RegionServer and storage nodes reduces from 100% to 10% in Figure 32.

Fig. 32. Normalized execution time of both in-storage and legacy compaction. The X-axis represents the available network bandwidth ratio of RegionServers, while the Y-axis shows the execution time divided by the maximum value.

Similarly, if we change the available network bandwidth of storage nodes as shown in Figure 33, the performance of Legacy Compaction also stays unchanged. Note that in Figure 33, Legacy Compaction performs better than IS-Compaction only when the available network bandwidth of storage nodes is extremely small (e.g., 10%).

Fig. 33. Normalized execution time of both in-storage and legacy compaction. The X-axis represents the available network bandwidth ratio of storage nodes, while the Y-axis shows the execution time divided by the maximum value.

In general, it typically takes much less time for IS-Compaction to complete compaction than that of Legacy Compaction under various configurations of storage nodes and network conditions due to less network traffic between storage nodes and RegionServers. Moreover, the performance of IS-Compaction is more stable when the available network bandwidth changes. Finally, since IS-Compaction decreases the network bandwidth consumption between RegionServers and storage nodes, the primary queries such as Get and Scan are less affected by compactions compared to the cases in Legacy Compaction.

8 CONCLUSION AND FUTURE WORK

In this article, we investigate the value, challenges, feasibility, and potential approaches of using in-storage computing to improve the performance of HBase in a compute-storage disaggregated infrastructure. The potential of using in-storage computing and the effectiveness of using ISSNs in storage nodes for HBase is demonstrated. We first conduct a set of experiments to investigate the performance of HBase in a compute-storage disaggregated infrastructure. We find that when the available network bandwidth is small, the performance of HBase is very low which is mainly caused by a high NET-AM. Based on these observations, we propose a new HBase architecture based on the in-storage computing concept called IS-HBase. IS-HBase runs an ISSN in each storage node and an ISSN can process the queries of Get and Scan in the storage node. Only useful KV-pairs are sent through the network to RegionServers. The read performance of IS-HBase is further improved by conducting a self-adaptive block cache in RegionServers, which addresses the caching issue of in-storage computing. Moreover, IS-HBase avoids the performance penalties caused by compactions by executing an in-storage compaction with the help of ISSN at storage nodes. IS-HBase is a good example that demonstrates the potential of adopting in-storage computing to optimize other data-intensive distributed applications. In our future work, we will further investigate the opportunities, challenges, and effectiveness of using in-storage computing for other applications like machine learning services to improve their performance in the compute-storage disaggregated infrastructure.

ACKNOWLEDGMENTS

We thank the anonymous TOS reviewers and all the members in the CRIS group for their useful comments and feedback to improve our research and paper. We thank Feng Wang for the comments and suggestions in this research.

REFERENCES

[1] 2014. Seagate Kinetic HDD Product Manual. Retrieved on 17 Feb., 2022 from https://www.seagate.com/www-content/product-content/hdd-fam/kinetic-hdd/_shared/docs/kinetic-product-manual.pdfSamsung_Key_Value_SSD_enables_High_Performance_Scaling-0.pdf.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[2] 2017. Samsung Key Value SSD enables High Performance Scaling.Google Scholar
Reference 1Reference 2Reference 3
[3] 2019. Amazon S3. Retrieved on 17 Feb., 2022 from https://aws.amazon.com/cn/s3/.Google Scholar
Reference 1Reference 2
[4] 2019. Apache HBase.Google Scholar
Reference 1Reference 2Reference 3
[5] 2019. Apache HDFS Users Guide. Retrieved on 17 Feb., 2022 from https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html.Google Scholar
Reference
[6] 2019. Iptables. Retrieved on 17 Feb., 2022 from https://en.wikipedia.org/wiki/Iptables.Google Scholar
Reference
[7] Angel Sebastian, Nanavati Mihir, and Sen Siddhartha. 2020. Disaggregation and the application. In 12th \( USENIX \) Workshop on Hot Topics in Cloud Computing. Pages 15.Google Scholar
Reference
[8] Atikoglu Berk, Xu Yuehai, Frachtenberg Eitan, Jiang Song, and Paleczny Mike. 2012. Workload analysis of a large-scale key-value store. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems. ACM, 53–64.Google ScholarDigital Library
Reference 1Reference 2
[9] Bae Duck-Ho, Kim Jin-Hyung, Kim Sang-Wook, Oh Hyunok, and Park Chanik. 2013. Intelligent SSD: A turbo for big data mining. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. ACM, 1573–1576.Google ScholarDigital Library
Reference 1Reference 2
[10] Bindschaedler Laurent, Goel Ashvin, and Zwaenepoel Willy. 2020. Hailstorm: Disaggregated compute and storage for distributed lsm-based databases. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. 301–316.Google ScholarDigital Library
Reference 1Reference 2
[11] Dhruba Borthakur. 2008. HDFS architecture guide. Hadoop Apache Project 53, 1–13 (2008), 2. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf.Google Scholar
Reference
[12] Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli, Marvin McNett, Sriram Sankaran, Kavitha Manivannan, and Leonidas Rigas. 2011. Windows azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles. ACM, 143–157.Google Scholar
Reference 1Reference 2
[13] Cao Xiang, Minglani Manas, and Du David Hung-Chang. 2017. Data allocation of large-scale key-value store system using kinetic drives. In Proceedings of the IEEE 3rd International Conference on Big Data Computing Service and Applications. IEEE, 60–69.Google ScholarCross Ref
Reference 1Reference 2
[14] Cao Zhichao, Dong Siying, Vemuri Sagar, and Du David H. C.. 2020. Characterizing, modeling, and benchmarking RocksDB key-value workloads at facebook. In Proceedings of the 18th USENIX Conference on File and Storage Technologies. 209–224.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[15] Chang Fay, Dean Jeffrey, Ghemawat Sanjay, Hsieh Wilson C., Wallach Deborah A., Burrows Mike, Chandra Tushar, Fikes Andrew, and Gruber Robert E.. 2008. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems 26, 2 (2008), 4.Google ScholarDigital Library
Reference 1Reference 2
[16] Cooper Brian F., Silberstein Adam, Tam Erwin, Ramakrishnan Raghu, and Sears Russell. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing. ACM, 143–154.Google ScholarDigital Library
Reference 1Reference 2
[17] Desai Dharmesh. 2019. The Cloud Advantage: Decoupling Storage and Compute. Retrieved on 17 Feb., 2022 from https://www.qubole.com/blog/advantage-decoupling/.Google Scholar
Reference
[18] Do Jaeyoung, Kee Yang-Suk, Patel Jignesh M., Park Chanik, Park Kwanghyun, and DeWitt David J.. 2013. Query processing on smart SSDs: Opportunities and challenges. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 1221–1230.Google ScholarDigital Library
Reference 1Reference 2Reference 3
[19] George Lars. 2011. HBase: The Definitive Guide: Random Access to Your Planet-size Data. “O’Reilly Media, Inc.”Google Scholar
Reference
[20] Boncheol Gu, Andre S. Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon, Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, Jaeheon Jeong, and Duckhyun Chang. 2016. Biscuit: A framework for near-data processing of big data workloads. In Proceedings of the ACM SIGARCH Computer Architecture News. 153–165.Google Scholar
Reference
[21] Harter Tyler, Borthakur Dhruba, Dong Siying, Aiyer Amitanand, Tang Liyin, Arpaci-Dusseau Andrea C., and Arpaci-Dusseau Remzi H.. 2014. Analysis of HDFS under HBase: A facebook messages case study. In Proceedings of the 12th USENIX Conference on File and Storage Technologies.199–212.Google Scholar
Reference
[22] Jalaparti Virajith, Douglas Chris, Ghosh Mainak, Agrawal Ashvin, Floratou Avrilia, Kandula Srikanth, Menache Ishai, Naor Joseph Seffi, and Rao Sriram. 2018. Netco: Cache and i/o management for analytics over disaggregated stores. In Proceedings of the ACM Symposium on Cloud Computing. 186–198.Google ScholarDigital Library
Reference 1Reference 2
[23] Jo Insoon, Bae Duck-Ho, Yoon Andre S., Kang Jeong-Uk, Cho Sangyeun, Lee Daniel D. G., and Jeong Jaeheon. 2016. YourSQL: A high-performance database system leveraging in-storage computing. Proceedings of the VLDB Endowment 9, 12 (2016), 924–935.Google ScholarDigital Library
Reference 1Reference 2Reference 3
[24] Kang Yangwook, Kee Yang-suk, Miller Ethan L., and Park Chanik. 2013. Enabling cost-effective data processing with smart ssd. In Proceedings of the IEEE 29th symposium on mass storage systems and technologies. IEEE, 1–12.Google ScholarCross Ref
Reference 1Reference 2
[25] Kim Sungchan, Oh Hyunok, Park Chanik, Cho Sangyeun, and Lee Sang-Won. 2011. Fast, energy efficient scan inside flash memory SSDs. In Proceeedings of the International Workshop on Accelerating Data Management Systems.Google Scholar
Reference
[26] Liang Shengwen, Wang Ying, Liu Cheng, Li Huawei, and Li Xiaowei. 2019. InS-DLA: An In-SSD deep learning accelerator for near-data processing. In Proceeedings of the IEEE 29th International Conference on Field Programmable Logic and Applications. IEEE, 173–179.Google ScholarCross Ref
Reference
[27] Minglani Manas, Diehl Jim, Cao Xiang, Li Binghze, Park Dongchul, Lilja David J., and Du David HC. 2017. Kinetic action: Performance analysis of integrated key-value storage devices vs. leveldb servers. In Proceeedings of the IEEE 23rd International Conference on Parallel and Distributed Systems. IEEE, 501–510.Google ScholarCross Ref
Reference 1Reference 2
[28] O’Neil Patrick, Cheng Edward, Gawlick Dieter, and O’Neil Elizabeth. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351–385.Google ScholarDigital Library
Reference
[29] Dongchul Park, Jianguo Wang, and Yang-Suk Kee. 2016. In-storage computing for hadoop mapreduce framework: Challenges and possibilities. IEEE Transactions on Computers (2016), 1–1. DOI:Google ScholarCross Ref
Reference 1Reference 2
[30] Ben Pfaff, Justin Pettit, Teemu Koponen, Ethan Jackson, Andy Zhou, Jarno Rajahalme, Jesse Gross, Alex Wang, Joe Stringer, Pravin Shelar, Keith Amidon, Awake Networks, and Martín Casado. 2015. The design and implementation of open vswitch. In Proceedings of the 12th USENIX Networked Systems Design and Implementation. 117–130.Google Scholar
Reference 1Reference 2
[31] Resines M. Zotes, Heikkila S. S., Duellmann D., Adde G., Toebbicke R., Hughes J., and Wang L.. 2014. Evaluation of the Huawei UDS cloud storage system for CERN specific data. Journal of Physics: Conference Series 513, 4 (2014), 042024.Google ScholarCross Ref
Reference
[32] Riedel Erik, Faloutsos Christos, Gibson Garth A., and Nagle David. 2001. Active disks for large-scale data processing. Computer 34, 6 (2001), 68–74.Google ScholarDigital Library
Reference 1Reference 2
[33] Riedel Erik and Gibson Garth. 1997. Active Disks-remote Execution for Network-attached Storage. Technical Report. Carnegie-Mellon University, School of Computer Science.Google ScholarCross Ref
Reference 1Reference 2
[34] Riedel Erik, Gibson Garth, and Faloutsos Christos. 1998. Active storage for large-scale data mining and multimedia applications. In Proceedings of the 24th Conference on Very Large Databases. Citeseer, 62–73.Google Scholar
Reference 1Reference 2
[35] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies. 1–10.Google Scholar
Reference
[36] Sun Hui, Liu Wei, Huang Jianzhong, and Shi Weisong. 2018. Co-kv: A collaborative key-value store using near-data processing to improve compaction for the lsm-tree. arXiv:1807.04151. Retrieved from https://arxiv.org/abs/1807.04151.Google Scholar
Reference
[37] Sun Hui, Liu Wei, Qiao Zhi, Fu Song, and Shi Weisong. 2018. Dstore: A holistic key-value store exploring near-data processing and on-demand scheduling for compaction optimization. IEEE Access 6 (2018), 61233–61253. DOI:Google ScholarCross Ref
Reference
[38] Vora Mehul Nalin. 2011. Hadoop-HBase for large-scale data. In Proceedings of the 2011 International Conference on Computer Science and Network Technology. IEEE, 601–605.Google ScholarCross Ref
Reference
[39] Vuppalapati Midhul, Miron Justin, Agarwal Rachit, Truong Dan, Motivala Ashish, and Cruanes Thierry. 2020. Building an elastic query engine on disaggregated storage. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation. 449–462.Google Scholar
Reference
[40] Wang Jianguo, Park Dongchul, Kee Yang-Suk, Papakonstantinou Yannis, and Swanson Steven. 2016. SSD in-storage computing for list intersection. In Proceedings of the 12th International Workshop on Data Management on New Hardware. ACM, 4.Google ScholarDigital Library
Reference 1Reference 2
[41] Wang Jianguo, Park Dongchul, Papakonstantinou Yannis, and Swanson Steven. 2016. SSD in-storage computing for search engines. IEEE Transactions on Computers. DOI:Google ScholarCross Ref
Reference 1Reference 2

Index Terms

IS-HBase: An In-Storage Computing Optimized HBase with I/O Offloading and Self-Adaptive Caching in Compute-Storage Disaggregated Infrastructure
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
2. Information systems
  1. Data management systems
  2. Information storage systems
    1. Storage architectures
      1. Storage network architectures

Recommendations

Hybrid HBase: leveraging flash SSDs to improve cost per throughput of HBase
COMAD '12: Proceedings of the 18th International Conference on Management of Data

Column-oriented data stores, such as BigTable and HBase, have successfully paved the way for managing large key-value datasets with random accesses. At the same time, the declining cost of flash SSDs have enabled their use in several applications ...
Read More
Rethinking HBase: Design and Implementation of an Elastic Key-Value Store over Log-Structured Local Volumes
ISPDC '15: Proceedings of the 2015 14th International Symposium on Parallel and Distributed Computing

HBase is a prominent NoSQL system used widely in the domain of big data storage and analysis. It is structured as two layers: a lower-level distributed file system (HDFS)supporting the higher-level layer responsible for data distribution, indexing, and ...
Read More
A Distributed Storage Model for EHR Based on HBase
ICIII '11: Proceedings of the 2011 International Conference on Information Management, Innovation Management and Industrial Engineering - Volume 02

HBase is a distributed column-oriented database built on top of Hadoop Distributed File System and Integrated into the Hadoop map-reduce platform. HBase provides distributed storage on the common computer cluster. Electronic healthcare record is the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Storage Volume 18, Issue 2
May 2022
248 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3522733
Editor:
Sam H. Noh
Ulsan National Institute of Science and Technology, Ulsan, Republic of Korea
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 April 2022
- Online AM: 29 March 2022
- Accepted: 1 September 2021
- Revised: 1 July 2021
- Received: 1 January 2020
Published in tos Volume 18, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
In-storage computing
database
HBase
key-value store
performance improvement
caching
compute-storage disaggregated infrastructure
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 1,541
  Total Downloads
- Downloads (Last 12 months)1,039
- Downloads (Last 6 weeks)116
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

IS-HBase: An In-Storage Computing Optimized HBase with I/O Offloading and Self-Adaptive Caching in Compute-Storage Disaggregated Infrastructure

ACM Transactions on Storage

Abstract

1 INTRODUCTION

2 BACKGROUND

2.1 HBase Preliminary

2.2 Compute-Storage Disaggregated Infrastructure

3 RELATED WORK

3.1 Active Storage Devices

3.2 Near-Data Processing

4 PERFORMANCE DEGRADATION OF HBASE IN COMPUTE-STORAGE DISAGGREGATED INFRASTRUCTURES

4.1 Experimental Setup

4.2 NET-AM Evaluation

4.3 Observations

5 IS-HBASE DESIGN

5.1 System Overview

5.2 Scan and Get

5.2.1 Scan.

5.2.2 Get.

5.3 In-Storage Compaction

5.4 Data Correctness and Consistency

5.5 Cache Design

6 SELF-ADAPTIVE BLOCK CACHE

6.1 Architecture of Self-Adaptive Block Cache

6.2 Self-Adaptive Cache Adjustment

6.3 Get, Scan, and Delete with Self-Adaptive Block Cache

7 PERFORMANCE EVALUATION

7.1 Basic IS-HBase Evaluation

7.2 HBase-Sim Evaluation

7.2.1 Workload Generation.

7.2.2 Ablation Studies.

7.2.3 Independent Studies.

7.2.4 In-storage Compaction vs. Legacy Compaction.

8 CONCLUSION AND FUTURE WORK

ACKNOWLEDGMENTS

REFERENCES

Cited By

Index Terms

Recommendations

Hybrid HBase: leveraging flash SSDs to improve cost per throughput of HBase

Rethinking HBase: Design and Implementation of an Elastic Key-Value Store over Log-Structured Local Volumes

A Distributed Storage Model for EHR Based on HBase

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media