Conversion cost and specification on interfaces of key-value stores

https://doi.org/10.1016/j.csi.2016.02.007Get rights and content

Highlights

  • Design an interface description model to abstract interface characteristics.

  • Quantify and evaluate the difference of two interfaces.

  • Propose a prototype interfaces as a reference of the interface specification.

  • Provide guidance for the design of key–value stores' interfaces.

Abstract

Due to the quick growth of data created and analyzed by industry and business requirements become more complex, many companies come to employ more than one key-value store together to serve different tasks. Considering key-value stores currently define their own interfaces which have different attributes and semantics, interoperability among these key-value stores is weak. To get the best interoperability, we may choose the store whose interfaces are similar to the others, or we may define an interface specification such as SQL specification in relational databases. We propose an interface description model (IDM for short) to abstract interfaces of different key-value stores, and an algorithm to quantify their differences, named as conversion cost. With the help of these algorithms, we can measure and compare the interoperability of given two stores. After studying the interoperability of many stores, we propose an interface prototype, which has the minimum conversion cost to the interfaces of other stores, as a reference to the interface specification of key-value store. Experiments show the features of interfaces, and prove that the proposed prototype has the best interoperability to other typical stores.

Introduction

With the rapid development of information technologies, the scale of data created and analyzed by industry grows larger and larger. Traditional relational databases, which organize data into relational models, meet a performance bottleneck when querying and managing big data in such situations [1]. To break the limitation of relational databases, NoSQL database such as key-value store emerged in recent years. Data relationship in key-value store is simplified and transactional property for ACID is generally given up [2]. Varieties of key-value stores are designed in different architectures to meet different requirements, but all of them follow key-value data model which will be explained in Section 3. Due to their great performance and flexibility, key-value stores are widely used in industry in support of the big data management [3].

Different databases, especially databases in different categories, are in different architectures. If multiple databases are employed within a system, each database has its own advantages when handling some specified types of data, and is efficient in specified scenarios. Using and maintaining more than one database within a system help to reduce the heavy burdens caused by various use cases. Each use case corresponds to the most appropriate database, therefore the efficiency and scalability of system are guaranteed. And for techniques such as data warehouse and data integration, due to the variety of big data, heterogeneous databases are often employed for storing data. Besides, the system, which can employ multiple databases, has better portability. It avoids vendor lock-in by changing existing database easily, compared with the system that only locked on one database. The last but not least, applications should access multiple databases via a uniform interface. Therefore, the diversity and complexity of the underlying databases are transparent to the developers.

Because of the previous reasons, the usage of multiple key-value stores becomes more common and used frequently in industry. For an operational system, performance is the key issue, and diverse data are required to meet various requirements. Multiple databases support the diverse data well with high performance. For analytical systems, it is frequent to perform data analysis on the massive and various data. We tend to store data in heterogeneous databases because of the different equipment or different approaches it takes during data collection. The middleware of query driven data integration, which access multiple databases with a uniform interface, is frequently adopted in data analysis system.

Actually, most Internet companies like Facebook have already deployed multiple databases into their development environments. Facebook adopts several key-value stores to meet its various business requirements: HBase for messages service and monitoring, Haystack for photo store and Memcached for in-memory data store. Twitter employs Cassandra for atomic counting and HBase to power its search engine. E-business sites like Amazon adopts key-value store such as MongoDB and Riak to record users' click stream, and Redis to achieve effective static pages serving or caching of product related data. The Chinese online shopping platform, named TaoBao, also adopts the in-memory database, transaction supported key-value database, massive contents oriented key-value store and achieved oriented storage to support its business. Interoperability, representing the ability to share data and work together among key-value stores, is necessary [4]. From another perspective, considering the compatibility, portability, extensibility and reusability of system, it is underlying databases that dominate these properties. It is beneficial if a system could change their underlying databases easily, or support heterogeneous databases. For instance, the uniform interface of data access layer and the lower cost of mapping database interface which will contribute to the portability.

Existing key-value stores, such as HBase, SimpleDB, CouchDB, Riak and MongoDB, have similar data model but distinctive architectures. They provide the similar interfaces, but the same concept may correspond to different terms and structures. Most key-value stores now provide interfaces to access data directly through HTTP protocol that is RESTful API (Application Program Interface) [5]. However, when invoking interfaces through HTTP protocol, users must create a HTTP request and then parse the data appended in the HTTP response, and the corresponding key-value store defines the particular data format of HTTP messages. Due to the difference of interfaces in format, it is inevitable to develop specialized programs, which are costly and less of scalability, for data exchange and data integration needs among key-value stores. Interoperations among database based applications and programming on data access frameworks are hard to achieve as well.

Considering interfaces of key-value stores are more or less distinctive, users have to understand many interface formats. Data access turns to be inefficient and tedious. If there is a specification on interfaces of key-value store, as same as SQL of relational databases, the problem of interoperability and vendor lock-in could be solved. When replacing existing key-value stores, such kind of specification helps to choose the appropriate one among alternative stores, which has less conversion cost to achieve interoperations with key-value stores remained in the system. And it can also provide guidance for the design of related database interfaces and standards through finding relatively less costly interface format. Unfortunately, there is no such specification so far and few researches study the conversion costs of interfaces of key-value stores. In this paper, based on the “Information technology, cloud data storage and management, part 5, specification on interfaces of Key-value store” [6], which is the project issued by “Information Technology Standardization Administration of China”, we define and evaluate the conversion cost on interfaces of key-value stores, and provide a reference to the specification on them.

In this paper, database refers to key-value store since which is not a proper database; interface refers to an operational one in database, such as insert interface and delete interface; (interface) distance is a cost of adapting one interface into another. Database interfaces refer to all operational interfaces of database; database distance is an aggregated cost of adapting interfaces of one database into those of another database correspondingly. Based on the previous description, there are several pending questions which may help to the interface specification of key-value stores. (1) Given interfaces of two databases, what is the conversion cost between them, how to quantify the cost. (2) Whether there is an existing key-value store which has the minimum conversion cost to other databases; (3) Theoretically, whether an interface, whose conversion cost to the other interfaces is minimum, could be defined. (4) How to design and estimate the interface specification. As much as our knowledge, several research works, which are explained in Section 2, are focus on the similar topics.

In this paper we present an interface description model to describe interfaces of key-value stores, and the model highlights the crucial structure that makes the interface distinctive among others. Each interface is abstracted into a tree structure through this model and we propose a quantification algorithm to evaluate the difference, named as “distance”, between interfaces. Essentially, given two interfaces of the same operation in two key-value stores, the distance is the cost it takes when adapting one interface into another. Referring the edit graph algorithm, the distance can be calculated, and then, the interfaces which have the minimum distances to the others indicate a theoretic specification. Experimental results prove that the proposed specification has less distance than other interfaces from existing key-value stores. Our contributions in this paper are listed as follows. (1) Propose an interface description model to abstract interface characteristics. (2) Propose a general way to quantify and evaluate the difference between two RESTful interfaces defined by key-value stores and represent the difference as distance. (3) Based on the conversion costs of interfaces of existing key-value stores, we have discussed the use of the proposed approach and propose an interface specification (a referenced one).

The rest of this paper is organized as follows. Following the introduction, Section 2 introduces the related work. Section 3 introduces the definitions of interface description model and Section 4 explains the algorithms of distance quantification. Section 5 discusses several potential scenarios of the conversion cost and Section 6 introduces several proposed database interfaces as a prototype of national specification on interfaces of key-value store. And in Section 7, we evaluate the distance of HBase, SimpleDB, CouchDB, Riak, MongoDB and our Prototype. The experiment result shows that Prototype has these minimum conversation costs than others comprehensively. Finally, conclusions and future works are summarized in Section 8.

Section snippets

Related work

To guarantee the interoperability of traditional relational databases, some researchers suggest integrating multiple autonomous database systems as a federated database [7], [8], [9]. The federation database mainly maintains mappings to data structures or schemas of any two different databases. Although the solution has solved data integration problem among different databases, however, it tends to build a management system over existing databases. It is hard to maintain the system and fails to

Model

In this section, an Interface Description Model (IDM for short) is proposed for modeling the typical structures of interfaces provided by key-value stores. IDM is a tree structure which describes the relationship among components. The distance, which is a cost of adapting one interface to another, is defined as the time complexity of exchanging one tree to another. In this section, the IDM is well defined, and the algorithm for calculating the distance is explained in the next section.

In

Algorithms

As defined above, the process of adapting operation trees T1 to T2 is represented as an adaption queue and the distance of T1 and T2, marked as | T1, T2 |, is calculated by aggregating the cost of all adaptations in the queue. In this section we study the algorithm of calculating the distance of two interfaces, named as conversion cost quantification algorithm. Edit distance is a way of quantifying how dissimilar two sequences are by counting the minimum number of operations required to transform

Application

In this paper, we proposed the approach of “measuring conversion cost of two database interfaces”. Interfaces of key-value stores are distinctive, so developers have to understand many interface formats to access key-value stores, which turns to be inefficient and tedious. The conversion cost measures the difficulty of interface adaptation crossing multiple databases essentially, and there are many potential scenarios in which the conversion cost helps to make a decision.

The proposed approach,

Specification

In this paper, the proposed interface description model and algorithms have been applied successfully to the “specification on interfaces of key-value stores” in China through estimating the interoperability of each interface defined in the specification. For database vendors who tend to provide their own particular interfaces rather than unified interfaces, data migration among different databases is laborious. This specification is aimed at providing a set of referred interfaces for the

Experiments

In this section, we plan several experiments to evaluate the IDM and algorithms. Firstly, we verify the validity of our quantification algorithm on conversion cost by comparing it with true cost and costs calculated by two similar algorithms. Then, we evaluate the distance among the Prototype and five typical, widely used key-value stores. All quantification algorithms are implemented in Java 6 and the whole process of adaption is executed on an Intel i5–2300 2.80Ghz Windows 7 Professional

Conclusions and future works

In this paper, we study the conversion cost and specification on interfaces of key-value stores. Considering key-value stores tend to define their own interfaces and the same semantics may have different representations, the interoperability between them is poor without defining an interface specification. The conversion cost between two interfaces is not only but an important consideration when an interface specification is defined, otherwise the existing databases are too hard to follow the

Acknowledgements

Supported by the National Natural Science Foundation of China under Grant No. 61433008 and 61502090; the Natural Science Foundation of Liaoning Province under Grant No.201403314; the Science Foundation of China Post-doctor under Grant No. 2013M540232.

References (27)

  • László Dobos et al.

    A platform for federated scientific databases and services

  • Xinhua Xu

    A study on query optimization for federated database systems

    Comput. Inf. Sci.

    (2009)
  • Olivier Curé et al.

    Data Integration over NoSQL Stores Using Access Path Based Mappings

  • Cited by (0)

    View full text