Lavoisier: A DSL for increasing the level of abstraction of data selection and formatting in data mining

https://doi.org/10.1016/j.cola.2020.100987Get rights and content

Abstract

Input data of a data mining algorithm must conform to a very specific tabular format. Data scientists arrange data into that format by creating long and complex scripts, where different low-level operations are performed, and which can be a time-consuming and error-prone process. To alleviate this situation, we present Lavoisier, a declarative language for data selection and formatting in a data mining context. Using Lavoisier, script size for data preparation can be reduced by 40% on average, and by up to 80% in some cases. Additionally, accidental complexity present in state-of-the-art technologies is considerably mitigated.

Introduction

We live in a time where data analysis techniques are becoming very popular, as they have demonstrated to be beneficial for the success of an organisation or project. Examples exist in multiple domains, such as agriculture [1], (bio)medical areas [2], [3], system security [4], or solid-state materials research [5]. Despite this extended usage, executing data mining processes still requires performing a lot of low-level technical tasks, where an explicit and fine-grained management of multiple details is mandatory [6]. As a result, the level of abstraction at which data scientists typically work is very low, which hinders their productivity.

Among these technical tasks, the most time-consuming one is typically the selection and preparation of data for an analysis [7]. Most data mining algorithms, such as those found in the Weka [8], KNIME [9] or scikit-learn [10] libraries, require their input data to be arranged in a very specific two-dimensional tabular format, where all the information related to each entity under analysis must be placed in a single row. For example, if we were analysing businesses by using information about sales, business providers and customers satisfaction, all this information, for each business, would have to be placed into cells of a single row of the table providing input data. This means that these algorithms cannot work with hierarchical or linked data such as JSON or XML files, or relational tables connected by means of foreign keys, which are common examples of representations in which information is typically made available. Therefore, to execute a data mining algorithm, we first need to transform data stored in these representations into the specific tabular format that these algorithms can process.

Data scientists perform this data transformation process by creating long and complex scripts, written in data management languages such as SQL (Structured Query Language) [11], R [12], or Pandas [13] (i.e. a well-known Python data manipulation library). These scripts extract data from the available sources and, through a set of low-level operations, such as joins [14] or pivots [15], [16], arrange these data as a tabular dataset that fulfils the one entity, one row constraint previously commented. The elaboration of these scripts, which is a crucial step for the outcome of any analysis [17], can be a tedious, time-consuming and error-prone process.

To alleviate this situation, we present Lavoisier, a language that aims to automate some of the data management tasks that data scientists need to perform when building datasets. To automate these tasks, Lavoisier provides a set of declarative constructs that focus on specifying what information, among the available in a certain domain, should be included in a concrete analysis. These constructs are automatically processed by the language interpreter through different chains of data transformation operations, such as joins and pivots. Therefore, using Lavoisier, data scientists can focus on specifying what data must be selected for a certain analysis, and forget about the details of how the selected data must be transformed to conform a dataset, which contributes to increase their productivity.

For instance, let A and B be entities of a domain, where instances of A can refer to several B instances through a bs relationship (AbsB), and each B instance has a b_id identifier attribute. To generate a properly formatted dataset of A instances, including the information of bs, the expression in Lavoisier would be mainclass A include bs by b_id. On the contrary, to achieve the same result using SQL or Pandas, a 2 to 3 times longer and more complex expression would be required. Precisely, we would need to perform a left join between A and B, followed by a pivot. Moreover, some extra fine-grained operations to, for instance, avoid name collisions between attributes of the combined entities, might also be required. This scenario is detailed in Section 3.4, using concrete entities from a business reviews domain.

The expressiveness and effectiveness of Lavoisier were assessed by a comparison against the two technologies mentioned in the previous paragraph, which are currently very popular for data manipulation: SQL [11] and Pandas [13]. In the comparison, a comprehensive set of data selection and preparation scenarios were initially devised, using two different case studies for this purpose. Then, for each scenario, we compared the corresponding Lavoisier specification with its SQL and Pandas’ counterparts. As a result of this comparison, we concluded that Lavoisier’s dataset specifications are more compact, less verbose and allow working at a higher abstraction level. In general, script size can be reduced on average due to the use of Lavoisier by 60% and 40% with respect to SQL and Pandas, and by up to 80% in some cases. This script size reduction is mainly caused by Lavoisier’s dataset specifications requiring 40% fewer operations on average, and 70% less parameters than SQL and Pandas’ counterparts.

This paper updates and extends a previous contribution presented at the 9th International Conference on Model and Data Engineering (MEDI) [18]. Over this contribution, we include:

  • a more detailed context and problem statement in Section 2,

  • a revised and extended description by example of the language in Section 3,

  • a description of the implementation, which was not included in the conference version of this paper, and which presents the internal structure of Lavoisier in Section 4, and

  • an extended and more rigorous evaluation of our work in Section 5, where more extraction scenarios and an additional case study have been included.

After the evaluation, Section 6 comments on related work and, finally, Section 7 summarises this article and outlines future work.

Section snippets

Case study and problem statement

This section describes with more detail the motivation behind this work. To illustrate it, we use the Yelp Dataset Challenge, which is introduced next.

Lavoisier: Dataset extraction language

Since Lavoisier aims to be a language for selecting a subset of all data available in a domain, we need a mechanism to describe these data. This mechanism should be a high-level notation, such as conceptual modelling notation, that allows us to focus on domain data and their relationships, and avoids technical issues about how these data are stored. From the different candidate notations, we selected object-oriented models, such as the one in Fig. 1, since this technique is widely used nowadays

Implementation

Here we include the relevant aspects with respect to the implementation of Lavoisier. Precisely, we describe the main language components, and the steps through which a Lavoisier script is processed to generate output datasets. This implementation has been open-sourced in an external repository.3

We defined and implemented Lavoisier from scratch, instead of opting for extending any existing language. The reasoning behind this decision is the fact

Evaluation

Lavoisier aims to increase the abstraction level at which data scientists work when creating datasets. To achieve this goal, Lavoisier provides different high-level primitives that, when processed, automatically execute a set of low-level operations that rearrange domain data into a tabular form that data mining algorithms can process. This provides two main advantages: (1) data scientists do not have to write by hand boilerplate code containing long chains of data transformation operations;

Related work

To the best of our knowledge, this is the first language designed to select data from domain models and generate tabular datasets from them. Currently, dataset extraction processes are usually performed by using SQL-like languages [11]; frameworks for data management that typically include their own languages, such as the R project [12] or Julia [37]; or libraries developed for general purpose programming languages, e.g., the Pandas library for Python [13] or Weka for Java [8]. As we have seen

Summary and future work

This work has presented Lavoisier, a language for assisting data scientists during the creation of datasets according to the format accepted by data mining algorithms. We started by presenting the data selection and transformation problem, which states that data mining algorithms can only receive data arranged in a specific tabular format. Therefore, before executing a data mining algorithm, we need to select and rearrange any hierarchical and linked domain data of interest for the analysis

CRediT authorship contribution statement

Alfonso de la Vega: Conceptualization, Methodology, Software, Validation, Writing - original draft, Writing - review & editing, Visualization. Diego García-Saiz: Conceptualization, Writing - review & editing. Marta Zorrilla: Writing - review & editing. Pablo Sánchez: Conceptualization, Methodology, Writing - original draft, Writing - review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work has been funded by the Spanish Government under grant TIN2017-86520-C3-3-R. Some icons in Fig. 2, Fig. 8 were created by Smartline from Flaticon.

References (51)

  • PengC. et al.

    Delivery of agricultural drought information via web services

    Earth Sci. Inform.

    (2015)
  • SchmidtJ. et al.

    Recent advances and applications of machine learning in solid-state materials science

    npj Comput. Mater.

    (2019)
  • WittenI.H. et al.

    Data Mining: Practical Machine Learning Tools and Techniques

    (2016)
  • MunsonM.A.

    A study on the importance of and time spent on different modeling steps

    SIGKDD Explor. Newsl.

    (2012)
  • HallM. et al.

    The weka data mining software: An update

    SIGKDD Explor. Newsl.

    (2009)
  • BertholdM.R. et al.

    KNIME - the konstanz information miner: version 2.0 and beyond

    SIGKDD Explor.

    (2009)
  • PedregosaF. et al.

    Scikit-learn: Machine learning in Python

    J. Mach. Learn. Res.

    (2011)
  • BeighleyL.

    Head First SQL

    (2007)
  • R, The R Project for Statistical Computing,...
  • W. McKinney, Data structures for statistical computing in Python, in: Proceedings of the 9th Python in Science...
  • CoddE.F.

    A relational model of data for large shared data banks

    Commun. ACM

    (1970)
  • WyssC.M. et al.

    A formal characterization of PIVOT/UNPIVOT

  • de la VegaA. et al.

    Lavoisier: High-level selection and preparation of data for analysis

  • FayyadU. et al.

    From data mining to knowledge discovery in databases

    AI Mag.

    (1996)
  • HartmannT. et al.

    The next evolution of MDE: a seamless integration of machine learning into domain modeling

    Softw. Syst. Model.

    (2019)
  • Cited by (9)

    View all citing articles on Scopus
    View full text