Keywords

1 Introduction

In the Big Data era [25], NoSQL databases [13] have arisen as a solution for contexts where many clients perform a massive number of requests over previously unseen quantities of data. Examples of these contexts are social network databases like Facebook and Twitter or international online stores such as Amazon. NoSQL is not just a technology, but a global term that comprises different database paradigms, including document, key-value or column family-based stores [6, 17].

A common characteristic of NoSQL databases is that they are mainly used when the support of ACID transactions [14] from traditional Relational DataBase Management Systems (RDBMSs) is not vital and, for instance, some temporal inconsistencies in data are tolerable [7]. Dropping the support of ACID transactions allows NoSQL databases, among other things, to scale well against large volumes of data, and to offer an adequate service for a very high number of end users [15].

Another common and important characteristic is that the design of databases for many NoSQL technologies is highly dependent on how the stored data is accessed [8, 21]. In these databases, the structure of the data can be denormalized, in order to offer low latencies and high efficiency for the workload towards which they are prepared [24]. In contrast, this denormalization is not usually done in RDBMSs, where performance optimizations are obtained by other means, such as indexes or materialized views [1].

Unfortunately, the differences between NoSQL and RDBMSs shown above come with some losses for the NoSQL part, being the biggest one the inability to apply the well-known and heavily-tested design practises of relational databases to the definition of NoSQL data stores. These practises are based on conceptual models, such as the Entity-Relationship (ER) model [9] or UML relational specifications [20], from which many existing CASE tools can automatically infer the final database implementation [2]. In addition to this lost, the differences among NoSQL technologies provoke that the design of a NoSQL database may even vary depending on the paradigm we wish to employ [6]. For instance, the design decisions would not be the same if we were targeting a column family-based or a document-based data store [3].

Numerous works about NoSQL design exist in the literature [8, 11, 19, 21]. Nevertheless, due to the heterogeneity of NoSQL, these works usually only focus on a concrete technology. A high-level and conceptual solution for the design of NoSQL data stores, such as the ones available for relational systems, would be beneficial for the centralization of existent, concrete works into a common framework.

Based on this context, we present Mortadelo, a framework that generates NoSQL designs for the data store of our choice. By providing a technology-agnostic data structure model that also includes details about how data are going to be accessed, our framework is able to automatically generate an implementation adapted to the specificities and benefits of the targeted NoSQL database. Mortadelo defines a transformation process which, through a series of steps, transforms first the provided conceptual model into a logical model dependant on the used NoSQL paradigm, and then generates the implementation scripts that would instantiate the targeted NoSQL technology from that paradigm.

The main strength of Mortadelo can be found in its model-driven, modular architecture, which can be extended to support any new NoSQL paradigm or technology. This architecture has been developed employing de facto modeling standards such as the Eclipse Modeling Framework [23], with the objective of offering an homogeneous treatment of different NoSQL paradigms sustained over well-known technologies and foundations. With the development of Mortadelo, we expect to cover the existent gap in NoSQL design practices and, to offer analogous methodologies to the ones that can be employed for relational-based systems.

The validity of Mortadelo has been tested through the implementation of an homonymous tool, which currently supports the generation of specifications for column family-based systems, with concrete transformations for Cassandra [5]. The support of column-family data stores has supposed the development of a metamodel for the logical design of this kind of databases, and also the definition of a set of rules to transform the data structure model to this logical model and to the final implementation in the concrete technology. Additionally, we briefly introduce how we are working in the support of document-based data stores, including an example for MongoDB [10].

The remaining of the paper is structured as follows. In Sect. 2 we detail the different phases of the transformation process followed by Mortadelo to generate NoSQL databases. It includes the description of the different metamodels that intervene in the process and the rules employed in the transformation. In Sect. 3, we present the prototype tool which implements our framework. Next, in Sect. 4, related works in NoSQL design are discussed. Finally, we expose our conclusions and future work in Sect. 5.

2 Framework Description

We start by giving an overview of the transformation process supported by Mortadelo. Then, successive sections describe Mortadelo’s components with more detail.

2.1 Transformation Process Overview

Figure 1 shows the transformation process supported by Mortadelo. In this process, an input model is transformed in a succession of steps to obtain an implementation of certain target NoSQL data store technology. Next paragraphs comment on these steps.

Fig. 1.
figure 1

Transformation process of Mortadelo.

As introduced before, Mortadelo follows a model-driven approach. Therefore, the input of the transformation process is a model, which conforms to a metamodel that we have denoted as the Generic Data Model (GDM) (Fig. 1, left). An instance of the GDM represents a conceptual definition of the database provided by the user. The GDM is composed of two blocks: (i) the Structure Model, which contains the information about domain entities and their relationships; and (ii) the Access Queries, which define how data from the structure model are going to be requested. The GDM is intentionally platform-independent, so it can be used seamlessly as input for different NoSQL paradigms. We give more details about the GDM components in Sect. 2.3.

The transformation process starts by validating the provided GDM instance to assess that it contains no mistakes (Fig. 1, step 1). For instance, if an entity present in an access query is not defined in the GDM, the validation process would indicate an error.

In step 2, a model-to-model (M2M) transformation translates the conceptual GDM model into a logical NoSQL specification by the application of a set of transformation rules. Due to the heterogeneity of NoSQL, in Mortadelo a logical metamodel and an associated M2M transformation has to be defined for each NoSQL paradigm. In the figure, two logical metamodels are shown: a column family data model and a document data model. These metamodels are intermediate representations, which contain information specific to the paradigm they represent. For instance, the column family data model allows defining the column families that should be instantiated in the final database. However, these specifications are still abstracted from any implementation details, i.e., the logical model of a paradigm can be employed to represent technologies that belong to the same paradigm.

Finally, the third step of the transformation process consists in a model-to-text (M2T) transformation. The obtained logical model from the M2M transformation of step 2 is used to automatically generate an implementation script for the targeted technology. Continuing with the column family example, a M2T generation from a logical model could be performed to obtain a physical implementation for Cassandra, a database from this paradigm. An analogous example could be made for a document data model and a MongoDB implementation.

This transformation process has been specifically devised to make it easily extensible. For instance, if we wish to support another column family-based database, we would only need to define the M2T transformation from the column family logical data model to generate the implementation script of this new database. In the same way, if we wanted to include a new NoSQL paradigm that differs from the ones supported by Mortadelo, such as key-value stores, we would define a new chain of elements such as the one presented with dots in Fig. 1, starting with a logical model for that new paradigm and a M2M transformation from the GDM. This new logical model could then be employed in M2T transformations to target concrete key-value technologies. We consider that the modularity and extensibility offered by Mortadelo would favour cohesion and reuse of existent components, such as logical models and transformation rules.

Next sections detail the GDM metamodel and describe concrete examples of the transformation process for column family and document-based stores.

2.2 Generic Data Model (GDM)

As mentioned in the previous section, we use instances of the Generic Data Model (GDM) as input for Mortadelo. Figure 2 shows the GDM metamodel. This metamodel contains both the Structure Model and the Access Queries elements, which are described below.

Fig. 2.
figure 2

Fragment of the Generic Data Model metamodel.

The Structure Model (Fig. 2, left) is defined in a UML-like fashion. This is a well-known notation both in the modeling and database research areas, which presents adequate for the specification of the structure of domain data. Moreover, it is independent of any database technology, which is one of the requirements of the presented process. The data structure is defined by the specification of entities. These entities contain features of two kinds: (i) primitive attributes which store values of a certain type, and (ii) references to other related entities. The references of an entity can have variable cardinality, e.g., 1, 2, 4 or unlimited.

The Access Queries (Fig. 2, right) represent the requests that are going to be performed over the database. These queries are defined in the GDM over entities from the structure model. Queries are defined through a SQL-like structure, which facilitates their later specification with a textual notation. A Query is executed over a main entity, captured by a From element. Any reference from that entity can be included in the query through an Inclusion element. Inclusions work in the same way as a conventional join of a relational SQL query. In addition, entities referenced by those that have been included previously can also be included, i.e., inclusions can be recursively added as long as there are references available. The set of projection attributes that are retrieved by the query is specified as a list of AttributeSelection elements. This list can contain attributes coming from the From or the Inclusion entities. The condition of a query is captured with a BooleanExpression, which allows to declare any desired restrictions. The notation for boolean expressions is not shown in this article for the sake of simplicity and brevity, as this syntax is probably known by the reader. Finally, ordering can be specified through a set of AttributeSelections, again coming from the entities selected by the From and Inclusion elements.

Fig. 3.
figure 3

GDM’s Structure Model of the e-commerce platform example.

We now show a concrete instantiation of the GDM metamodel through an example. We have selected a database that stores data from an e-commerce platform. The structure model of this platform is shown in Fig. 3.

Clients of this online shop can make purchases of products. Each Purchase has an associated shipping Address and a Bill, which is optional. A Product can belong to different Categories, and it can be purchased from different Providers. The PurchaseLine entity allows to include different products in the same purchase.

Fig. 4.
figure 4

Example of a GDM access query over entities of the structure model.

Continuing with the GDM instance definition, in Fig. 4 an example of how an access query from our GDM can be textually specified is shown. This query retrieves all products of a given category ordered by their prices. The instantiation of the query in the GDM would be as follows. The From entity would be Product (line 3), and an Inclusion is defined to add the Category entity through the categories reference (line 4). From these entities, the retrieved attributes are the name, description and price of the products, and the category name (line 2). The aliases prod and cat are employed to simplify the attribute selection. A condition is defined in line 5 through an equality that restricts the shown products to those belonging to a specific category, which is indicated by its name. Lastly, in line 6, an order by clause specifies that the products should be ordered by their price.

In this section, we have seen how input databases can be specified by the instantiation of the structure data model and the access queries of the Generic Data Model. GDM specifications do not contain NoSQL details, which allows employing them as input for any NoSQL technology. Next section shows the logical model for column family data stores, and how Mortadelo can perform the transformations that generate a physical implementation of a Cassandra database from a GDM instance.

2.3 Transformations for Column Family-Based Stores

Figure 5 shows the logical metamodel for column family-based stores. Any provided GDM instance model can be automatically transformed with Mortadelo to conform to these metamodel through a model-to-model transformation.

Fig. 5.
figure 5

Metamodel for the logical modeling of column family-based databases.

In this kind of NoSQL databases, information is stored in structures denoted Column Families (CFs), which are collections of rows that contain Column values. These rows are uniquely identified by a key, which is defined by a selection of columns from the CF. For some CF databases, like Cassandra, the columns that conform are organized in two subsets: (i) the partition key and (ii) the clustering key. The partition key is used to distribute the data of a CF into different physical nodes or machines. Rows with the same partition key are stored together. The clustering key allows to indicate the physical ordering of the CF rows inside each partition.

In this kind of column family databases, because querying rows from different physical locations would be inefficient, only data from a CF partition can be queried each time, this is, only a concrete value for the partition key can be requested on each query. This provokes that the redundancy of having different CFs storing the same data is not only recommended, but a necessary mechanism in order to query these data with different conditions.

Columns of a CF can have an assigned type, which can be simple, a collection of simple elements, or user defined. These last type is a composition of other types that can help to perform data denormalizations, an operation that is common in this kind of data stores.

Continuing with the online shop example presented in the previous section, we could define a CF for the storage of products. In Fig. 6, an instantiation of this CF with the logical metamodel notation is shown. The CF is denoted ProductById. It is composed of four simple columns: productId, description, name and price. The key is composed of a single column, the productId, which acts as the partition key. This means that each partition would contain data of a single product, and that each query would have to specify the productId of the concrete product of interest.

Fig. 6.
figure 6

Example column family from the logical model that stores products data.

Next, we show how Mortadelo can generate a database for Cassandra NoSQL technology, traversing through the column family logical model. The input GDM of this example is composed of the Structure Model shown in Fig. 3, while the following queries conform the GDM’s Access Queries:

 

Q1:

Products data, given their productId.

Q2:

Products data, together with the data from their associated categories, given the product name.

Q3:

Products data, given their categories’ names, and ordered by price.

Q4:

Purchases data, with their associated bills, given the purchase year, and ordered by purchaseDate.

Q5:

Purchase data, with their purchase lines, the client’s name and the products data, given the nationality of the client, and ordered by purchaseDate.

 

Fig. 7.
figure 7

Logical model of the sample database for column family databases.

Figure 7 shows the logical model generated by our framework when applying a M2M transformation to the provided GDM instance. As instances of logical models can become too verbose if displayed graphically (e.g. all the elements of Fig. 6 only represent a column family), we show the column families definition in a more compact format, where CFs are specified with the<ColumnFamily> stereotype, and<UserDefinedType> does the same for user defined types. The complete logical model, which follows the format shown in Fig. 6, can be visualized in the GitHub repository of our tool, MortadeloFootnote 1.

The first query (Q1) only requests data from one entity, so a simple transformation rule is applied to generate the ProductById column family from Fig. 6 described above. For the query Q2, which involves Product and Category entities, the column family ProductByName is created, which contains product and category columns. Given that none of the Category columns belongs to the CF key, a user defined type denoted categoryType is created, which holds data about categories. Then, the ProductByName CF stores the categories of a product as a list of type categoryType.

Although query Q3 involves the same entities than Q2, i.e. Product and Category, in this case the categories’ names are part of the partition key. Moreover, the products’ price belong to the clustering key, in order to introduce ordering. Requiring different keys provokes that a new CF must be created, and this time no user-defined type can be employed. The generated CF is ProductCategories, which contains as columns the attributes from both entities, as shown in Fig. 7. Also, given that the two columns used in the query, i.e. category name and product price, do not guarantee row uniqueness, an extra field denoted idprodcat has been added at the end of the clustering key.

Similar rules are applied to generate, from the rest of the sample queries, the other column families shown. For details about the complete transformations rules, we remit again to our tool’s repository.

We show in Fig. 8 the resulting database implementation for Cassandra, which is obtained by our framework in a code generation step from the logical model. Cassandra offers a SQL-like language for database query and definition, called Cassandra Query Language (CQL). In this language, column families are treated and denoted as tables. The primary key, which includes the columns that uniquely identifies the rows, is divided in two sets of columns: the first set corresponds to the partition key and the second one to the clustering key.

The current logical metamodel for column families shown in Fig. 5 is also valid, in its current form, for generating code for other databases, like ScyllaDB, which works similarly to Cassandra. However, this metamodel may contain certain concepts that are specific of the Cassandra technology, e.g., the CF keys structure. We plan to abstract these concepts in future iterations, in order to ease the support of other column family data stores.

Fig. 8.
figure 8

Cassandra CQL implementation of the sample database.

2.4 Towards Transformations for Document-Based Stores

In this section, we show the current state of our work for the generation of document-based data stores. These stores are generally schema-less. However, as the purpose of Mortadelo is the provision of NoSQL designs based on the storage and data access requirements of the end users, this framework generates a set of collections, whose objective is to store documents, along with a proposed structure to which these documents should conform in order to better support the end user needs. The set of collections and their suggested structure for the documents is defined in a logical document data model. Figure 9 shows an example of this model.

As introduced, a document data model is composed of Collections, which have a name that identifies them. Each collection will be used to store documents. The structure of these documents is captured in a DocumentType element. At the moment, collections in Mortadelo are only used to store one kind of document, i.e., they only have one associated instance of DocumentType. However, if we later find out that, for some use cases, it is beneficial to store several types of documents in the same collection, the model will be updated accordingly. A DocumentType element defines the structure of documents as a collection of Fields. These fields can be Primitive elements, Arrays of elements, or even nested DocumentTypes inside the main one. In addition, as some document databases allow defining indexes over these fields to improve performance, we have included this functionality in the metamodel (Fig. 10).

Fig. 9.
figure 9

Metamodel for the logical modeling of document-based stores.

Fig. 10.
figure 10

Example of a denormalized collection in MongoDB that answers query Q2.

For this kind of databases, the Access queries of the GDM can be used to determine whether the logical model must follow a more normalized design, with each collection representing a different entity of the Structural Model; or a more de-normalized one, by embedding some entities into another. Figure 9 shows an example of a document that represents a product in MongoDB. Each product contains an embedded array to store the data of the categories to which it belongs. When following this structure, categories are repeated several times, one for each product belonging to them, which introduces data redundancy in the system. On the other hand, this de-normalization could be useful to make the sample query Q2 more efficient, since all the required information is contained in a single collection, instead of being necessary to consult several of them (e.g. consulting the categories referenced by a product). We are working in more mechanisms to adapt the provided GDM and transformations to the specificities of document databases.

3 Implementation

We have implemented a prototype of Mortadelo to assess the transformation process presented in the previous section. This implementation has been made available under a free licence in an external repositoryFootnote 2. Next paragraphs summarize the main components of this repository.

The metamodels presented in Sect. 2 can be found in the corresponding projects of the repository in Ecore [23] format. Precisely, the GDM, column family, and document metamodels are included. In addition, the projects also contain the model-to-model and model-to-text specifications that conform the transformation process. Conventionally, M2M transformations are specified through model-to-model languages such as ETL or ATL. These languages are useful when each input element of a certain type is transformed into one or more output elements. However, this strict mapping could not be appropriate when generating NoSQL designs. For instance, it could be the case that two queries of the GDM’s Access Queries can be answered through the same column family of a Cassandra data store, instead of generating one column family for each query. Therefore, the data structure and access queries have to be treated all at once in the transformation, instead of in a one-by-one basis. For this reason, we decided to employ an imperative language for the M2M transformation process. We selected XtendFootnote 3, which is a Java-based language that offers advanced model manipulation capabilities. In the case of M2T transformations, they have been specified with EGL (Epsilon Generation Language) [22].

Fig. 11.
figure 11

Editor of the provided GDM textual DSL.

For the GDM metamodel, a textual Domain-Specific Language (DSL) [18] for the manipulation of GDM instances is also provided. This DSL has been implemented with Xtext [12], which provides a full-featured and easily configurable editor. Figure 11 shows a screenshot, where the online shop case study is manipulated through the DSL editor. The left window shows the syntax of the DSL, which allows to define and validate entities and queries over these entities. On the top right window, the corresponding GDM instance model of the processed “onlineShop.gdm” file is shown. This instance would be the input of Mortadelo’s transformation process. Below, in the Properties view, individual details of concrete elements from the model can be consulted, such as the AttributeSelection object selected in the figure.

Finally, an examples project is included, which contains the specifications and resulting NoSQL schemas for the online shop running example of this paper.

4 Related Work

As we mentioned in the introduction, well-known practises of the design process of relational databases are not suitable for NoSQL systems because of the differences between them and RDBMSs [3, 21].

There are works in the literature that face the challenge of NoSQL database design. However, because of the heterogeneity present in NoSQL technologies, most of these works limit their efforts to a concrete paradigm, such as column families [8, 21], key-value [19] or graph-based [11] stores. For instance, Mior et al. [21] present NoSE (NoSQL Schema Evaluator), an initially generic tool for obtaining NoSQL schemas. However, this work focuses on column family databases and, as the authors state in their conclusions, “NoSE may require significant changes to fully exploit the capabilities of different data models”.

Nonetheless, the lack of generality of these works does not make them unusable for our purposes. As mentioned in Sect. 2.2, one of the steps performed by our framework is the transformation of a generic conceptual model to the logical model of a concrete NoSQL paradigm. So, it is possible to include the described process of individual works for a specific paradigm into Mortadelo, therefore contributing to the homogenisation of these works under the same framework. As an example, for column family databases, we have taken as base transformation rules the ones included in NoSE. Also, we have improved them by overcoming some of their limitations, such as for example the lack of support for User Defined Types and Collections that are useful for the design of certain column families.

Instead of by abstracting the design stage, other approaches bring the generality to the application level by presenting high-level interfaces to access underlying data stores. Authors of [4] present one of these interfaces, denoted as SOS (Save Our Systems), which offers a common data access layer for the interconnection with different NoSQL physical storage systems.

There are two works that require special comments, as their objectives relate to the ones of Mortadelo. In the first one, Herrero et al. [16] present a NoSQL design process for analytical workloads. This process, as the one defined by Mortadelo, is divided in three phases, where a conceptual model is first used as input to obtain a logical model, which later gets instantiated in a physical implementation. One of the main differences with respect to our proposal is that, rather than performing manual steps, we seek to automatically generate the NoSQL schemas from the provided generic data model. However, authors of the mentioned work take into account important factors for the analytical workloads they support, such as data variability. These factors could be included in a future to improve Mortadelo’s transformation process.

The second of these works, authored by Atzeni et al. [3], presents NoAM (NoSQL Abstract Metamodel), a design metamodel that does not focus on any particular NoSQL technology but on giving support to all of them. An instance of this metamodel represents a technology-agnostic NoSQL schema through high-level concepts, which have been generalized from the characteristics of existent NoSQL paradigms.

When we started working on Mortadelo, we studied the possibility of using NoAM as the intermediate logical model that is employed in the transformation process, prior to the code-generation step into a concrete NoSQL solution. Nevertheless, we detected that more information than the one contained in NoAM models was necessary to perform the final transformations for some of the NoSQL databases. For instance, in the case of column family data stores like Cassandra, an extra differentiation between partition and clustering keys is necessary for the final instantiation. We compared the overhead of using a combination of NoAM plus this extra information against the definition of logical metamodels for each NoSQL paradigm, and decided than the latter option was simpler in our case. This is why we employ a column family metamodel and a document metamodel in Sect. 2, instead of a single intermediate model such as NoAM. The use of a logical model for each paradigm or family of NoSQL data stores allows Mortadelo to remain agnostic of concrete details of technologies such as Cassandra or MongoDB until it is necessary (i.e. when the code generation templates for concrete systems are executed). Moreover, these logical models can be reused between technologies of the same paradigm, such as MongoDB and CouchDB for document stores.

5 Conclusions and Future Work

This paper has presented Mortadelo, a framework for the generation of NoSQL databases. The main contribution of Mortadelo is that, following a model-driven approach, it can be used to automatically obtain the implementation of a targeted NoSQL database, by using as input a technology-agnostic data structure model that also includes the description of how data are usually accessed. An advantage offered by this framework is its modular structure, which eases the inclusion of support for new database paradigms or technologies.

We have shown how Mortadelo can be used to generate databases for column family data stores, with a full example for the Cassandra database. We have detailed all the steps of the proposed framework for this example: (i) the implementation of a conceptual data model to specify the data structure in a technology-agnostic way; (ii) the development of an intermediate logical metamodel that captures details of column family databases; and (iii) the implementation of a set of rules to automatically transform the data structure model to the logical model, and this logical model to the implementation code in Cassandra. Also, we have established the first steps to extend our framework for the support of document-based databases, like MongoDB or CouchDB.

As an additional contribution, we have implemented an homonymous prototype tool of Mortadelo. The development of this tool is active, and the metamodels and transformations explained throughout the paper are available in the tool’s repository.

We are currently working towards offering full support for document-based data stores. As future work, we will study the expansion of the framework to support other kind of NoSQL paradigms, like key-value stores or graph databases. This will also involve researching how to extend the technology-agnostic data structure model in order to take into account other components in the transformation process. After the functionality of Mortadelo has been tested, it is also important to consider the non-functional requirements that usually affect the design of NoSQL data stores. Issues such as scalability, security, consistency, technology/storage restrictions, or workload frequency will be taken into account for future improvements.