Mortadelo: Automatic generation of NoSQL stores from platform-independent data models

https://doi.org/10.1016/j.future.2019.11.032Get rights and content

Highlights

  • A NoSQL design process based on a paradigm-independent conceptual data model.

  • Physical database models are generated through model transformation techniques.

  • Physical database models can be obtained for column family or document databases.

  • The generation process can be configured with technology-independent annotations.

  • Design quality is not compromised to achieve genericity.

Abstract

In the last decade, several NoSQL systems have emerged as a response to the scalability problems manifested by classical relational databases when used in Big Data contexts. These NoSQL systems appeared first as physical-level solutions, initially lacking any design methodologies. After this initial batch of systems, several design methodologies for NoSQL have been recently created. Nevertheless, most of these methodologies target just one NoSQL paradigm. In addition, as each methodology uses a different conceptual modeling approach, NoSQL database designers would need to remake conceptual models as they switch from one NoSQL paradigm to another. Moreover, most of these design processes provide just a set of design heuristics and guidelines that database designers need to apply manually, which can be a time-consuming and error-prone process. To overcome these limitations, this article presents Mortadelo, a model-driven NoSQL database design process where, from a high-level conceptual model, independent of any specific NoSQL paradigm, an implementation for a concrete NoSQL database system can be automatically generated. Moreover, this database generation process can be customized, so that some design trade-offs can be managed differently according to each context needs. We evaluated Mortadelo’s capabilities by generating database implementations for several typical NoSQL case studies. In these cases, Mortadelo was able to generate implementations for the Cassandra and MongoDB NoSQL systems from the same conceptual data model. These implementations were similar to the ones generated by design methodologies specifically developed for a single paradigm. Therefore, design quality is not sacrificed by our approach in favor of generality.

Introduction

Nowadays, several kinds of software systems have pushed relational databases to their limits. Examples of these new kinds of applications are internet-scale applications, such as Twitter or Amazon; Internet of Things (IoT) applications, such as Smart Cities [1], [2]; Industry 4.0 systems [3], [4]; or Big Data systems [5], [6]. All these systems need to manage large volumes of data that are often distributed in several servers and assure low response times and high availability in the contexts of a high number concurrent requests. In these scenarios, relational databases have manifested different scalability problems [7].

In response to these limitations, a new generation of database management systems, denoted as NoSQL systems [8], started to offer some alternatives. Each one of these alternatives was designed for a specific purpose and following a different approach. So, NoSQL is not just a single alternative to relational databases, but a global term that comprises different database strategies, including, among others, document-oriented databases [9], [10], key–value stores [11], [12] or column family stores [13], [14]. Despite their differences, most NoSQL database systems rely on two common features: (1) they make use of data denormalisation to improve response times [15], and (2) they sacrifice some ACID (Atomicity, Consistency, Isolation, and Durability) properties [7], [16] to increase scalability, while providing other less restrictive properties but also useful in a best-effort approach, such as eventual consistency [17].

These NoSQL technologies emerged first at the implementation level and, consequently, they initially lacked well-defined design processes. Database design methodologies for relational databases [18], which are usually based on conceptual modeling notations such as ER (Entity-Relationship) [19] or UML (Unified Modeling Language) [20], revealed soon to be not enough for designing NoSQL databases. To take advantage of the benefits provided by data nesting and denormalisation, database designers need to take into account not only which data will be stored in the database, but also how these data will be accessed [21], [22]. In NoSQL systems, working with the same set of data, but with different data access patterns, might lead to different database implementations. This is due to the fact that, in many NoSQL systems, design decisions are driven by how data will be accessed. Traditional database design approaches do not provide an adequate support for these issues, mainly because they were created to satisfy other goals, e.g., the commented ACID properties [16].

To address this gap, several design methodologies for NoSQL systems have been created in the last years [21], [22], [23], [24]. Nevertheless, these approaches still present some limitations, which can be summarized as follows:

  • 1.

    Each one of these approaches focuses on a concrete NoSQL paradigm, providing its own conceptual modeling languages and notations. This implies that the same conceptual data model cannot be used to describe the same database in different NoSQL paradigms.

  • 2.

    Most approaches describe how to design a NoSQL database by means of guidelines or heuristics that must be interpreted and applied manually by database designers. This can be an error-prone and time-consuming process. Just two approaches [21], [24] address design process automation and provide the basis for building CASE tools.

  • 3.

    Those approaches that tackle automation often use the same strategy to transform the patterns they found at the conceptual level into constructs of the target database. Therefore, they neglect the existence of alternative strategies that might be more adequate in certain contexts, or when targeting different NoSQL paradigms, e.g., document-based or column family stores.

To overcome these limitations, we present Mortadelo, a model-driven development process for NoSQL database design. This process builds on previous work and goes one step further by being able to automatically generate implementations for different NoSQL paradigms from the same conceptual model. The generation process is achieved by means of model transformation and code generation techniques. Currently, we have created and implemented model transformation rules for supporting the generation of column family stores and document databases, but the framework could be extended to support other paradigms, such as key–value stores.

To evaluate the expressiveness and effectiveness of our approach, we used Mortadelo to model different case studies used as test-beds in the NoSQL literature. We compared the generated NoSQL databases with the databases obtained with state-of-the-art NoSQL design methodologies. The results of this evaluation process showed that, using Mortadelo, the same conceptual model can be transformed into either a column family database, implemented in Cassandra [25]; or a document database, expressed in MongoDB [9]. Moreover, the obtained databases were pretty similar to those generated by design methodologies devised specifically for one NoSQL paradigm. In some cases, our approach performed even better, and, in one case, our designs might not be as good as the ones generated by other approaches. Moreover, our approach offers several transformation alternatives, so the same conceptual model might be handled differently depending on each concrete context. This feature is scarcely supported by NoSQL design methodologies.

The remaining of this article is structured as follows. Section 2 presents the running example used throughout the paper, and introduces to the used NoSQL technologies. Next, in Section 3, related works are discussed. In Section 4, we detail the different phases of the transformation process followed by Mortadelo to generate NoSQL databases. Section 5 includes the evaluation of Mortadelo. Lastly, we expose our conclusions and future work in Section 6.

Section snippets

Background

To make this article self-contained, this section provides some background of the used technologies, i.e., column family and document data stores. Before describing these technologies, we introduce the running example that we used to illustrate the different concepts that appear in this work.

Related work

As NoSQL systems emerged, different approaches addressing the design problems of these systems were created. These approaches are summarized in Table 1. We briefly describe each one of these approaches.

Li [23] presented one of the very first works on NoSQL database design. They proposed a set of high-level heuristics for refactoring relational databases into HBase ones. To produce a NoSQL database using this work, we would need to create a relational database first, and then to transform it to

Solution description

We start by describing the general components of the transformation process defined by Mortadelo. Then, successive sections describe these components with more detail.

Evaluation and discussion

As it was stated in the introduction, the main goal of Mortadelo is to automatically generate databases for different NoSQL paradigms from the same high-level data model. The previous section has shown how this goal is satisfied for the running example. This section analyzes how Mortadelo works in other case studies, providing evidences about its applicability. To evaluate whether our work can be used in different settings, we analyzed it from the following four perspectives: (EI-1)

Summary and future work

This work has presented Mortadelo, a model-driven design process for the generation of NoSQL databases. The main contribution of Mortadelo, when compared with other state-of-the-art approaches, is that, from the same conceptual data model, Mortadelo is able to generate implementations for different NoSQL technologies, such as column family or document-based ones. To the best of our knowledge, this is the first NoSQL database design methodology with these characteristics. To generate these NoSQL

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

Authors would like to express their gratitude to the software engineers from Soincon (http://soincon.es/) for their collaboration during the evaluation of Mortadelo. This work has been partially funded by the Government of Cantabria (Spain) under the doctoral studentship program from the University of Cantabria, Spain, and by the Spanish Government under grant TIN2017-86520-C3-3 R.

Alfonso de la Vega is a postdoctoral researcher at the Software Engineering and Real-Time Group of the University of Cantabria. He received his Ph.D. degree from the same university in 2019. His current research focuses on incorporating the benefits of model-driven engineering and domain-specific languages into the data manipulation and data mining fields, with a special focus on reducing the complexity in conventional analysis processes.

References (46)

  • AndersonJ.C. et al.

    CouchDB: The Definitive Guide: Time To Relax

    (2010)
  • DeCandiaG. et al.

    Dynamo: Amazon’s highly available key–value store

  • CarlsonJ.L.

    Redis in Action

    (2013)
  • HewittE.

    Cassandra: the Definitive Guide

    (2010)
  • GeorgeL.

    Hbase: the Definitive Guide: Random Access to Your Planet-Size Data

    (2011)
  • VajkT. et al.

    Denormalizing data into schema-free databases

  • HaerderT. et al.

    Principles of Transaction-Oriented Database Recovery

    (1983)
  • ChandraD.G.

    Base analysis of nosql database

    Future Gener. Comput. Syst.

    (2015)
  • CoddE.F.

    The Relational Model for Database Management: Version 2

    (1990)
  • ChenP.P.-S.

    The entity-relationship model–toward a unified view of data

    ACM Trans. Database Syst.

    (1976)
  • LiL. et al.

    UML specification and relational database

    J. Object Technol.

    (2003)
  • ChebotkoA. et al.

    A big data modeling methodology for Apache Cassandra

    Int. Congr. Big Data

    (2015)
  • MiorM.J. et al.

    NoSE: Schema design for NoSQL applications

    IEEE Trans. Knowl. Data Eng.

    (2017)
  • Cited by (26)

    • MDICA: Maintenance of data integrity in column-oriented database applications

      2023, Computer Standards and Interfaces
      Citation Excerpt :

      Starting from a conceptual model, it is automatically transformed into a NoSQL schema [22] that can serve the queries with minimal cost [18], it is mapped to heterogeneous datastores [23], and MongoDB [17] and HBase [19] databases are designed. In [24], a tool is designed to generate implementations for Cassandra and MongoDB from the same conceptual data model. Chebotko et al. [20] and Mior et al. [21] focus their approach on generating logical and physical Cassandra models from the application's conceptual data model and supported queries, which have been leveraged for this work.

    • A workload-driven method for designing aggregate-oriented NoSQL databases

      2022, Data and Knowledge Engineering
      Citation Excerpt :

      Also, the NoAM block is an abstraction of NoSQL constructs such as CFs or DCs, and it may be problematic in identifying important features of the target data model, such as the partition and clustering keys of a CF schema. Mortadelo [33] proposes a Generic Data Metamodel (GDM) for conceptual model and two NoSQL data models (i.e., CF and DC) as well as the query workload. Highly Used (HU) entity types are used to control the denormalization levels of entities.

    • The central role of data repositories and data models in Data Science and Advanced Analytics

      2022, Future Generation Computer Systems
      Citation Excerpt :

      These findings are illustrated through the leader election protocol. The third paper, titled: “Mortadelo: Automatic Generation of NoSQL Stores from Platform-Independent Data Models” [20], by Alfonso de la Vega, García-Saiz, Carlos Blanco, Marta Zorrilla, and Pablo Sánchez, proposes a model-driven NoSQL database design process. Mortadelo starts from a high-level conceptual model that is independent of any specific NoSQL paradigm (column family or document-based ones) and produces the data structures for a concrete target NoSQL database system.

    • Managing polyglot systems metadata with hypergraphs

      2021, Data and Knowledge Engineering
      Citation Excerpt :

      In cases where an entity can become a part of several different hyper nodes, it is replicated in each of them. Mortadelo [30] introduces a model-driven database design process to automatically generate a concrete NoSQL database system from a high-level conceptual model. The platform-independent Generic Data Metamodel (GDM) is used to represent not only structural data but also the data access patterns.

    View all citing articles on Scopus

    Alfonso de la Vega is a postdoctoral researcher at the Software Engineering and Real-Time Group of the University of Cantabria. He received his Ph.D. degree from the same university in 2019. His current research focuses on incorporating the benefits of model-driven engineering and domain-specific languages into the data manipulation and data mining fields, with a special focus on reducing the complexity in conventional analysis processes.

    Diego García-Saiz is an Assistant Professor at the University of Cantabria. He obtained his Ph.D. in Computer Science in 2016. His main research lines are in the Data Science and Software Engineering fields, having several publications in high impact journals, conferences and book chapters.

    Carlos Blanco has a Ph.D. in Computer Science from the University of Castilla-La Mancha (Spain). He is working as a lecturer at the Science Faculty at the University of Cantabria (Spain) and is a member of several research groups: GSyA (University of Castilla-La Mancha) and ISTR (University of Cantabria). His research activity is in the field of Security for Information Systems and its specially focused on assuring Big Data, Data Warehouses and OLAP systems by using MDE approaches. He has published several international communications, papers and book chapters related with these topics (DSS, CSI, INFSOF, ComSIS, TCJ, ER, DaWaK, etc.). He is involved in the organization of several international workshop (WOSIS, WISSE, MoBiD) and has served as reviewer for international journals, conferences and workshops (INFSOF, CSI, DSS, TCJ, ARES, ER, DaWaK, SECRYPT, etc.).

    Marta Zorrilla is an Associate Professor in Computer Science within the Software Engineering and Real Time Group at the University of Cantabria (Spain). She has been involved in several national and European research projects together with other international research institutions. Her research interests are database technologies, data mining and big data, currently applied to Industry 4.0. She is the author of a database book and has more than 60 works published in international journals, chapters and conferences. She is an active reviewer of several international journals and conferences such as Expert Systems with Applications, Decision Support Systems, International Journal of Information Technology & Decision Making, IEEE Transactions on Human–Machine Systems, among others.

    Pablo Sánchez is an Assistant Professor at the University of Cantabria. He has worked in different research topics, such as aspect-oriented software development, model-driven development, software product lines or domain-specific languages. He has been an active member of several European research projects. His work can be found at conferences like MODELS, ECMDA, ECSA, or SLE and international journals such as Information and Software Technology, Journal of Object Technology, Computer Journal, or Computer Languages, Systems and Structures. He has also been involved in events like the MoDRE workshop series.

    View full text