Mortadelo: Automatic generation of NoSQL stores from platform-independent data models

doi:10.1016/j.future.2019.11.032

Future Generation Computer Systems

Volume 105, April 2020, Pages 455-474

https://doi.org/10.1016/j.future.2019.11.032 Get rights and content

Highlights

•
A NoSQL design process based on a paradigm-independent conceptual data model.
•
Physical database models are generated through model transformation techniques.
•
Physical database models can be obtained for column family or document databases.
•
The generation process can be configured with technology-independent annotations.
•
Design quality is not compromised to achieve genericity.

Abstract

In the last decade, several NoSQL systems have emerged as a response to the scalability problems manifested by classical relational databases when used in Big Data contexts. These NoSQL systems appeared first as physical-level solutions, initially lacking any design methodologies. After this initial batch of systems, several design methodologies for NoSQL have been recently created. Nevertheless, most of these methodologies target just one NoSQL paradigm. In addition, as each methodology uses a different conceptual modeling approach, NoSQL database designers would need to remake conceptual models as they switch from one NoSQL paradigm to another. Moreover, most of these design processes provide just a set of design heuristics and guidelines that database designers need to apply manually, which can be a time-consuming and error-prone process. To overcome these limitations, this article presents Mortadelo, a model-driven NoSQL database design process where, from a high-level conceptual model, independent of any specific NoSQL paradigm, an implementation for a concrete NoSQL database system can be automatically generated. Moreover, this database generation process can be customized, so that some design trade-offs can be managed differently according to each context needs. We evaluated Mortadelo’s capabilities by generating database implementations for several typical NoSQL case studies. In these cases, Mortadelo was able to generate implementations for the Cassandra and MongoDB NoSQL systems from the same conceptual data model. These implementations were similar to the ones generated by design methodologies specifically developed for a single paradigm. Therefore, design quality is not sacrificed by our approach in favor of generality.

Introduction

Nowadays, several kinds of software systems have pushed relational databases to their limits. Examples of these new kinds of applications are internet-scale applications, such as Twitter or Amazon; Internet of Things (IoT) applications, such as Smart Cities [1], [2]; Industry 4.0 systems [3], [4]; or Big Data systems [5], [6]. All these systems need to manage large volumes of data that are often distributed in several servers and assure low response times and high availability in the contexts of a high number concurrent requests. In these scenarios, relational databases have manifested different scalability problems [7].

In response to these limitations, a new generation of database management systems, denoted as NoSQL systems [8], started to offer some alternatives. Each one of these alternatives was designed for a specific purpose and following a different approach. So, NoSQL is not just a single alternative to relational databases, but a global term that comprises different database strategies, including, among others, document-oriented databases [9], [10], key–value stores [11], [12] or column family stores [13], [14]. Despite their differences, most NoSQL database systems rely on two common features: (1) they make use of data denormalisation to improve response times [15], and (2) they sacrifice some ACID (Atomicity, Consistency, Isolation, and Durability) properties [7], [16] to increase scalability, while providing other less restrictive properties but also useful in a best-effort approach, such as eventual consistency [17].

These NoSQL technologies emerged first at the implementation level and, consequently, they initially lacked well-defined design processes. Database design methodologies for relational databases [18], which are usually based on conceptual modeling notations such as ER (Entity-Relationship) [19] or UML (Unified Modeling Language) [20], revealed soon to be not enough for designing NoSQL databases. To take advantage of the benefits provided by data nesting and denormalisation, database designers need to take into account not only which data will be stored in the database, but also how these data will be accessed [21], [22]. In NoSQL systems, working with the same set of data, but with different data access patterns, might lead to different database implementations. This is due to the fact that, in many NoSQL systems, design decisions are driven by how data will be accessed. Traditional database design approaches do not provide an adequate support for these issues, mainly because they were created to satisfy other goals, e.g., the commented ACID properties [16].

To address this gap, several design methodologies for NoSQL systems have been created in the last years [21], [22], [23], [24]. Nevertheless, these approaches still present some limitations, which can be summarized as follows:

1.
Each one of these approaches focuses on a concrete NoSQL paradigm, providing its own conceptual modeling languages and notations. This implies that the same conceptual data model cannot be used to describe the same database in different NoSQL paradigms.
2.
Most approaches describe how to design a NoSQL database by means of guidelines or heuristics that must be interpreted and applied manually by database designers. This can be an error-prone and time-consuming process. Just two approaches [21], [24] address design process automation and provide the basis for building CASE tools.
3.
Those approaches that tackle automation often use the same strategy to transform the patterns they found at the conceptual level into constructs of the target database. Therefore, they neglect the existence of alternative strategies that might be more adequate in certain contexts, or when targeting different NoSQL paradigms, e.g., document-based or column family stores.

To overcome these limitations, we present Mortadelo, a model-driven development process for NoSQL database design. This process builds on previous work and goes one step further by being able to automatically generate implementations for different NoSQL paradigms from the same conceptual model. The generation process is achieved by means of model transformation and code generation techniques. Currently, we have created and implemented model transformation rules for supporting the generation of column family stores and document databases, but the framework could be extended to support other paradigms, such as key–value stores.

To evaluate the expressiveness and effectiveness of our approach, we used Mortadelo to model different case studies used as test-beds in the NoSQL literature. We compared the generated NoSQL databases with the databases obtained with state-of-the-art NoSQL design methodologies. The results of this evaluation process showed that, using Mortadelo, the same conceptual model can be transformed into either a column family database, implemented in Cassandra [25]; or a document database, expressed in MongoDB [9]. Moreover, the obtained databases were pretty similar to those generated by design methodologies devised specifically for one NoSQL paradigm. In some cases, our approach performed even better, and, in one case, our designs might not be as good as the ones generated by other approaches. Moreover, our approach offers several transformation alternatives, so the same conceptual model might be handled differently depending on each concrete context. This feature is scarcely supported by NoSQL design methodologies.

The remaining of this article is structured as follows. Section 2 presents the running example used throughout the paper, and introduces to the used NoSQL technologies. Next, in Section 3, related works are discussed. In Section 4, we detail the different phases of the transformation process followed by Mortadelo to generate NoSQL databases. Section 5 includes the evaluation of Mortadelo. Lastly, we expose our conclusions and future work in Section 6.

Section snippets

Background

To make this article self-contained, this section provides some background of the used technologies, i.e., column family and document data stores. Before describing these technologies, we introduce the running example that we used to illustrate the different concepts that appear in this work.

Related work

As NoSQL systems emerged, different approaches addressing the design problems of these systems were created. These approaches are summarized in Table 1. We briefly describe each one of these approaches.

Li [23] presented one of the very first works on NoSQL database design. They proposed a set of high-level heuristics for refactoring relational databases into HBase ones. To produce a NoSQL database using this work, we would need to create a relational database first, and then to transform it to

Solution description

We start by describing the general components of the transformation process defined by Mortadelo. Then, successive sections describe these components with more detail.

Evaluation and discussion

As it was stated in the introduction, the main goal of Mortadelo is to automatically generate databases for different NoSQL paradigms from the same high-level data model. The previous section has shown how this goal is satisfied for the running example. This section analyzes how Mortadelo works in other case studies, providing evidences about its applicability. To evaluate whether our work can be used in different settings, we analyzed it from the following four perspectives: (EI-1)

Summary and future work

This work has presented Mortadelo, a model-driven design process for the generation of NoSQL databases. The main contribution of Mortadelo, when compared with other state-of-the-art approaches, is that, from the same conceptual data model, Mortadelo is able to generate implementations for different NoSQL technologies, such as column family or document-based ones. To the best of our knowledge, this is the first NoSQL database design methodology with these characteristics. To generate these NoSQL

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

Authors would like to express their gratitude to the software engineers from Soincon (http://soincon.es/) for their collaboration during the evaluation of Mortadelo. This work has been partially funded by the Government of Cantabria (Spain) under the doctoral studentship program from the University of Cantabria, Spain, and by the Spanish Government under grant TIN2017-86520-C3-3 R.

Alfonso de la Vega is a postdoctoral researcher at the Software Engineering and Real-Time Group of the University of Cantabria. He received his Ph.D. degree from the same university in 2019. His current research focuses on incorporating the benefits of model-driven engineering and domain-specific languages into the data manipulation and data mining fields, with a special focus on reducing the complexity in conventional analysis processes.

References (46)

LuY.
Industry 4.0: A survey on technologies, applications and open research issues
J. Ind. Inf. Integr.
(2017)
SantosM.Y. et al.
A big data system supporting bosch braga industry 4.0 strategy
Int. J. Inf. Manage.
(2017)
BarbieratoE. et al.
Performance evaluation of nosql big-data applications using multi-formalism models
Future Gener. Comput. Syst.
(2014)
CorbelliniA. et al.
Persisting big-data: The nosql landscape
Inf. Syst.
(2017)
AtzeniP. et al.
Uniform access to NoSQL systems
Inf. Syst.
(2014)
AngL. et al.
Big sensor data systems for smart cities
IEEE Internet Things J.
(2017)
CostaC. et al.
Reinventing the energy bill in smart cities with nosql technologies
HechtR. et al.
NoSQL evaluation: A use case oriented survey
GessertF.
NoSQL database systems: A survey and decision guidance
Comput. Sci. - Res. Dev.
(2017)
ChodorowK.
MongoDB: The Definitive Guide: Powerful and Scalable Data Storage
(2013)

AndersonJ.C. et al.

CouchDB: The Definitive Guide: Time To Relax

(2010)

DeCandiaG. et al.

Dynamo: Amazon’s highly available key–value store

CarlsonJ.L.

Redis in Action

(2013)

HewittE.

Cassandra: the Definitive Guide

(2010)

GeorgeL.

Hbase: the Definitive Guide: Random Access to Your Planet-Size Data

(2011)

VajkT. et al.

Denormalizing data into schema-free databases

HaerderT. et al.

Principles of Transaction-Oriented Database Recovery

(1983)

ChandraD.G.

Base analysis of nosql database

Future Gener. Comput. Syst.

(2015)

CoddE.F.

The Relational Model for Database Management: Version 2

(1990)

ChenP.P.-S.

The entity-relationship model–toward a unified view of data

ACM Trans. Database Syst.

(1976)

LiL. et al.

UML specification and relational database

J. Object Technol.

(2003)

ChebotkoA. et al.

A big data modeling methodology for Apache Cassandra

Int. Congr. Big Data

(2015)

MiorM.J. et al.

NoSE: Schema design for NoSQL applications

IEEE Trans. Knowl. Data Eng.

(2017)

Cited by (26)

CoDEvo: Column family database evolution using model transformations
2023, Journal of Systems and Software
In recent years, software applications have been working with NoSQL databases as they have emerged to handle big data more efficiently than traditional databases. The data models of these databases are designed to satisfy the requirements of the software application, which means that the models must evolve when the requirements of the software application change. To avoid mistakes during the design and evolution of these NoSQL models, there are several methodologies that recommend using a conceptual model. This implies that consistency between the conceptual model and the schema must be maintained when either evolving the database or the software application. In this work, we propose CoDEvo, a model-driven engineering approach that uses model transformations to address the evolution of a NoSQL column family DBMS schema when the underlying conceptual model evolves due to software requirement changes, aiming to maintain consistency between the schema and conceptual model. We have addressed this problem by defining transformation rules that determine how to evolve the schema for a specific conceptual model change. To validate these transformations, we applied them to conceptual model changes from 9 open-source software applications, comparing the output schemas from CoDEvo with the schemas that were defined in these applications.
AStar: A modeling language for document-oriented geospatial data warehouses
2023, Data and Knowledge Engineering
A Geospatial Data Warehouse (GDW) is an extension of a traditional Data Warehouse that includes geospatial data in the decision-making processes. Several studies have proposed the use of document-oriented databases in a GDW as an alternative to relational databases. This is due to the ability of non-relational databases to scale horizontally, allowing for the storage and processing of large volumes of data. In this context, modeling the manner in which facts and dimensions are structured is important in order to understand, maintain, and evolve the Document-oriented GDW (DGDW) through visual analysis. However, to the best of our knowledge, there are no modeling languages that support the design of aggregated data as facts and dimensions, that can be represented as referenced or embedded documents, partitioned into one or more collections. To overcome this lack, we propose Aggregate Star (AStar), a Domain-Specific Modeling Language for designing DGDW logical schemas. AStar is defined from a concrete syntax (graphical notation), an abstract syntax (metamodel), and static semantics (well-formedness rules). In order to describe the semantics of the concepts defined in AStar, translational semantics map the graphical notation to the metamodel and the respective code, to define the schema in MongoDB (using JSON Schema). We evaluate the graphical notation using Physics of Notations (PoN), which provides a set of principles for designing cognitively effective visual notations. This evaluation revealed that AStar is in accordance with seven of the nine PoN principles, an adequate level of cognitive effectiveness. As a proof of concept, the metamodel and well-formedness rules were implemented in a prototype of Computer-Assisted Software Engineering tool, called AStarCASE. In its current version, AStarCASE can be used to design DGDW logical schemas and to generate their corresponding code in the form of JSON Schemas.
MDICA: Maintenance of data integrity in column-oriented database applications
2023, Computer Standards and Interfaces
Citation Excerpt :
Starting from a conceptual model, it is automatically transformed into a NoSQL schema [22] that can serve the queries with minimal cost [18], it is mapped to heterogeneous datastores [23], and MongoDB [17] and HBase [19] databases are designed. In [24], a tool is designed to generate implementations for Cassandra and MongoDB from the same conceptual data model. Chebotko et al. [20] and Mior et al. [21] focus their approach on generating logical and physical Cassandra models from the application's conceptual data model and supported queries, which have been leveraged for this work.
Current information technologies generate large amounts of data for management or further analysis, storing it in NoSQL databases which provide horizontal scaling and high performance, supporting many read/write operations per second. NoSQL column-oriented databases, such as Cassandra and HBase, are usually modelled following a query-driven approach, resulting in denormalized databases where the same data can be repeated in several tables. Therefore, maintaining data integrity relies on client applications to ensure that, for data changes that occur, the affected tables will be appropriately updated. We devise a method called MDICA that, given a data insertion at a conceptual level, determines the required actions to maintain database integrity in column-oriented databases. This method is implemented for Cassandra database applications. MDICA is based on the definition of (1) rules to determine the tables that will be impacted by the insertion, (2) procedures to generate the statements to ensure data integrity and (3) messages to warn the user about errors or potential problems. This method helps developers in two ways: generating the statements needed to maintain data integrity and producing messages to avoid problems such as loss of information, redundant repeated data or gaps of information in tables.
A workload-driven method for designing aggregate-oriented NoSQL databases
2022, Data and Knowledge Engineering
Citation Excerpt :
Also, the NoAM block is an abstraction of NoSQL constructs such as CFs or DCs, and it may be problematic in identifying important features of the target data model, such as the partition and clustering keys of a CF schema. Mortadelo [33] proposes a Generic Data Metamodel (GDM) for conceptual model and two NoSQL data models (i.e., CF and DC) as well as the query workload. Highly Used (HU) entity types are used to control the denormalization levels of entities.
Due to the scalability and availability problems with traditional relational database systems, a variety of NoSQL stores have emerged over the last decade to deal with big data. How data are structured in a NoSQL store has a large impact on the query and update performance and the storage usage. Thus, different from the traditional database design, not only the data structure but also the data access patterns need to be considered in the design of NoSQL database schemas. In this paper, we present a general workload-driven method for designing key-value, wide-column, and document NoSQL database schemas. We first present a generic logical model Query Path Graph (QPG) that can represent the data structures of the UML class diagram. We also define mappings from the SQL-based query patterns to QPG and from QPG to aggregate-oriented NoSQL schemas. We use a cost model to measure the query and update performance and optimize the QPG schemas. We evaluate the proposed method with several typical case studies by simulating workloads on databases with different schema designs. The results demonstrate that our method preserves the generality and the quality of the design.
The central role of data repositories and data models in Data Science and Advanced Analytics
2022, Future Generation Computer Systems
Citation Excerpt :
These findings are illustrated through the leader election protocol. The third paper, titled: “Mortadelo: Automatic Generation of NoSQL Stores from Platform-Independent Data Models” [20], by Alfonso de la Vega, García-Saiz, Carlos Blanco, Marta Zorrilla, and Pablo Sánchez, proposes a model-driven NoSQL database design process. Mortadelo starts from a high-level conceptual model that is independent of any specific NoSQL paradigm (column family or document-based ones) and produces the data structures for a concrete target NoSQL database system.
In the age of “Data Science and Advanced Analytics”, we are witnessing a race for developing data-driven smart systems in various domains such as business, finance, healthcare, environment, cybersecurity, etc. due the explosion of the data issued by various providers. This development contributes in getting added value for companies and citizens. Two complementary ingredients are required for ensuring valuable systems: data and models. The data dimension is mainly related to Data Science that unifies machine learning, statistics, data mining, databases, and distributed systems. The achievement of this value may pass through the augmentation of input data by resources such as Knowledge Graphs. The success of the above techniques strongly depends on the quality of the input data and the consideration of other non-functional properties related to legal, ethical, and economical aspects. On the other hand, modeling plays a crucial role in Data Science since it covers all steps of Data Science workflow. Regarding data provenance and its quality, models contribute to providing vendor-independent solutions. At the algorithmic level, models help in explaining the inner working of the used methods/algorithms to system designers, users, regulators, and citizens to achieve trust and accountability. Therefore, the success of Data Science depends on our skill to use it a smart way and simultaneously exploiting data and modeling capabilities.
Managing polyglot systems metadata with hypergraphs
2021, Data and Knowledge Engineering
Citation Excerpt :
In cases where an entity can become a part of several different hyper nodes, it is replicated in each of them. Mortadelo [30] introduces a model-driven database design process to automatically generate a concrete NoSQL database system from a high-level conceptual model. The platform-independent Generic Data Metamodel (GDM) is used to represent not only structural data but also the data access patterns.
A single type of data store can hardly fulfill every end-user requirements in the NoSQL world. Therefore, polyglot systems use different types of NoSQL datastores in combination. However, the heterogeneity of the data storage models makes managing the metadata a complex task in such systems, with only a handful of research carried out to address this. In this paper, we propose a hypergraph-based approach for representing the catalog of metadata in a polyglot system. Taking an existing common programming interface to NoSQL systems, we extend and formalize it as hypergraphs. Then, we define design constraints and query transformation rules for three representative data store types. Next, we propose a simple query rewriting algorithm from the metadata of the catalog to underlying data store specific ones and provide a prototype implementation. Furthermore, we introduce a storage statistics estimator on the underlying data stores. Finally, we show the feasibility of our approach on a use case of an existing polyglot system, and its usefulness in metadata and physical query path calculations.

View all citing articles on Scopus

Diego García-Saiz is an Assistant Professor at the University of Cantabria. He obtained his Ph.D. in Computer Science in 2016. His main research lines are in the Data Science and Software Engineering fields, having several publications in high impact journals, conferences and book chapters.

Carlos Blanco has a Ph.D. in Computer Science from the University of Castilla-La Mancha (Spain). He is working as a lecturer at the Science Faculty at the University of Cantabria (Spain) and is a member of several research groups: GSyA (University of Castilla-La Mancha) and ISTR (University of Cantabria). His research activity is in the field of Security for Information Systems and its specially focused on assuring Big Data, Data Warehouses and OLAP systems by using MDE approaches. He has published several international communications, papers and book chapters related with these topics (DSS, CSI, INFSOF, ComSIS, TCJ, ER, DaWaK, etc.). He is involved in the organization of several international workshop (WOSIS, WISSE, MoBiD) and has served as reviewer for international journals, conferences and workshops (INFSOF, CSI, DSS, TCJ, ARES, ER, DaWaK, SECRYPT, etc.).

Marta Zorrilla is an Associate Professor in Computer Science within the Software Engineering and Real Time Group at the University of Cantabria (Spain). She has been involved in several national and European research projects together with other international research institutions. Her research interests are database technologies, data mining and big data, currently applied to Industry 4.0. She is the author of a database book and has more than 60 works published in international journals, chapters and conferences. She is an active reviewer of several international journals and conferences such as Expert Systems with Applications, Decision Support Systems, International Journal of Information Technology & Decision Making, IEEE Transactions on Human–Machine Systems, among others.

Pablo Sánchez is an Assistant Professor at the University of Cantabria. He has worked in different research topics, such as aspect-oriented software development, model-driven development, software product lines or domain-specific languages. He has been an active member of several European research projects. His work can be found at conferences like MODELS, ECMDA, ECSA, or SLE and international journals such as Information and Software Technology, Journal of Object Technology, Computer Journal, or Computer Languages, Systems and Structures. He has also been involved in events like the MoDRE workshop series.

View full text

Mortadelo: Automatic generation of NoSQL stores from platform-independent data models

Highlights

Abstract

Introduction

Section snippets

Background

Related work

Solution description

Evaluation and discussion

Summary and future work

Declaration of Competing Interest

Acknowledgments

J. Ind. Inf. Integr.

Int. J. Inf. Manage.

Future Gener. Comput. Syst.

Inf. Syst.

Inf. Syst.

Big sensor data systems for smart cities

IEEE Internet Things J.

Reinventing the energy bill in smart cities with nosql technologies

NoSQL evaluation: A use case oriented survey

NoSQL database systems: A survey and decision guidance

Comput. Sci. - Res. Dev.

MongoDB: The Definitive Guide: Powerful and Scalable Data Storage

CouchDB: The Definitive Guide: Time To Relax

Dynamo: Amazon’s highly available key–value store

Redis in Action

Cassandra: the Definitive Guide

Hbase: the Definitive Guide: Random Access to Your Planet-Size Data

Denormalizing data into schema-free databases

Principles of Transaction-Oriented Database Recovery

Base analysis of nosql database

Future Gener. Comput. Syst.

The Relational Model for Database Management: Version 2

The entity-relationship model–toward a unified view of data

ACM Trans. Database Syst.

UML specification and relational database

J. Object Technol.

A big data modeling methodology for Apache Cassandra

Int. Congr. Big Data

NoSE: Schema design for NoSQL applications

IEEE Trans. Knowl. Data Eng.