Lavoisier: A DSL for increasing the level of abstraction of data selection and formatting in data mining
Introduction
We live in a time where data analysis techniques are becoming very popular, as they have demonstrated to be beneficial for the success of an organisation or project. Examples exist in multiple domains, such as agriculture [1], (bio)medical areas [2], [3], system security [4], or solid-state materials research [5]. Despite this extended usage, executing data mining processes still requires performing a lot of low-level technical tasks, where an explicit and fine-grained management of multiple details is mandatory [6]. As a result, the level of abstraction at which data scientists typically work is very low, which hinders their productivity.
Among these technical tasks, the most time-consuming one is typically the selection and preparation of data for an analysis [7]. Most data mining algorithms, such as those found in the Weka [8], KNIME [9] or scikit-learn [10] libraries, require their input data to be arranged in a very specific two-dimensional tabular format, where all the information related to each entity under analysis must be placed in a single row. For example, if we were analysing businesses by using information about sales, business providers and customers satisfaction, all this information, for each business, would have to be placed into cells of a single row of the table providing input data. This means that these algorithms cannot work with hierarchical or linked data such as JSON or XML files, or relational tables connected by means of foreign keys, which are common examples of representations in which information is typically made available. Therefore, to execute a data mining algorithm, we first need to transform data stored in these representations into the specific tabular format that these algorithms can process.
Data scientists perform this data transformation process by creating long and complex scripts, written in data management languages such as SQL (Structured Query Language) [11], R [12], or Pandas [13] (i.e. a well-known Python data manipulation library). These scripts extract data from the available sources and, through a set of low-level operations, such as joins [14] or pivots [15], [16], arrange these data as a tabular dataset that fulfils the one entity, one row constraint previously commented. The elaboration of these scripts, which is a crucial step for the outcome of any analysis [17], can be a tedious, time-consuming and error-prone process.
To alleviate this situation, we present Lavoisier, a language that aims to automate some of the data management tasks that data scientists need to perform when building datasets. To automate these tasks, Lavoisier provides a set of declarative constructs that focus on specifying what information, among the available in a certain domain, should be included in a concrete analysis. These constructs are automatically processed by the language interpreter through different chains of data transformation operations, such as joins and pivots. Therefore, using Lavoisier, data scientists can focus on specifying what data must be selected for a certain analysis, and forget about the details of how the selected data must be transformed to conform a dataset, which contributes to increase their productivity.
For instance, let A and B be entities of a domain, where instances of A can refer to several B instances through a bs relationship (), and each B instance has a b_id identifier attribute. To generate a properly formatted dataset of A instances, including the information of bs, the expression in Lavoisier would be mainclass A include bs by b_id. On the contrary, to achieve the same result using SQL or Pandas, a 2 to 3 times longer and more complex expression would be required. Precisely, we would need to perform a left join between A and B, followed by a pivot. Moreover, some extra fine-grained operations to, for instance, avoid name collisions between attributes of the combined entities, might also be required. This scenario is detailed in Section 3.4, using concrete entities from a business reviews domain.
The expressiveness and effectiveness of Lavoisier were assessed by a comparison against the two technologies mentioned in the previous paragraph, which are currently very popular for data manipulation: SQL [11] and Pandas [13]. In the comparison, a comprehensive set of data selection and preparation scenarios were initially devised, using two different case studies for this purpose. Then, for each scenario, we compared the corresponding Lavoisier specification with its SQL and Pandas’ counterparts. As a result of this comparison, we concluded that Lavoisier’s dataset specifications are more compact, less verbose and allow working at a higher abstraction level. In general, script size can be reduced on average due to the use of Lavoisier by 60% and 40% with respect to SQL and Pandas, and by up to 80% in some cases. This script size reduction is mainly caused by Lavoisier’s dataset specifications requiring 40% fewer operations on average, and 70% less parameters than SQL and Pandas’ counterparts.
This paper updates and extends a previous contribution presented at the 9th International Conference on Model and Data Engineering (MEDI) [18]. Over this contribution, we include:
- •
a more detailed context and problem statement in Section 2,
- •
a revised and extended description by example of the language in Section 3,
- •
a description of the implementation, which was not included in the conference version of this paper, and which presents the internal structure of Lavoisier in Section 4, and
- •
an extended and more rigorous evaluation of our work in Section 5, where more extraction scenarios and an additional case study have been included.
After the evaluation, Section 6 comments on related work and, finally, Section 7 summarises this article and outlines future work.
Section snippets
Case study and problem statement
This section describes with more detail the motivation behind this work. To illustrate it, we use the Yelp Dataset Challenge, which is introduced next.
Lavoisier: Dataset extraction language
Since Lavoisier aims to be a language for selecting a subset of all data available in a domain, we need a mechanism to describe these data. This mechanism should be a high-level notation, such as conceptual modelling notation, that allows us to focus on domain data and their relationships, and avoids technical issues about how these data are stored. From the different candidate notations, we selected object-oriented models, such as the one in Fig. 1, since this technique is widely used nowadays
Implementation
Here we include the relevant aspects with respect to the implementation of Lavoisier. Precisely, we describe the main language components, and the steps through which a Lavoisier script is processed to generate output datasets. This implementation has been open-sourced in an external repository.3
We defined and implemented Lavoisier from scratch, instead of opting for extending any existing language. The reasoning behind this decision is the fact
Evaluation
Lavoisier aims to increase the abstraction level at which data scientists work when creating datasets. To achieve this goal, Lavoisier provides different high-level primitives that, when processed, automatically execute a set of low-level operations that rearrange domain data into a tabular form that data mining algorithms can process. This provides two main advantages: (1) data scientists do not have to write by hand boilerplate code containing long chains of data transformation operations;
Related work
To the best of our knowledge, this is the first language designed to select data from domain models and generate tabular datasets from them. Currently, dataset extraction processes are usually performed by using SQL-like languages [11]; frameworks for data management that typically include their own languages, such as the R project [12] or Julia [37]; or libraries developed for general purpose programming languages, e.g., the Pandas library for Python [13] or Weka for Java [8]. As we have seen
Summary and future work
This work has presented Lavoisier, a language for assisting data scientists during the creation of datasets according to the format accepted by data mining algorithms. We started by presenting the data selection and transformation problem, which states that data mining algorithms can only receive data arranged in a specific tabular format. Therefore, before executing a data mining algorithm, we need to select and rearrange any hierarchical and linked domain data of interest for the analysis
CRediT authorship contribution statement
Alfonso de la Vega: Conceptualization, Methodology, Software, Validation, Writing - original draft, Writing - review & editing, Visualization. Diego García-Saiz: Conceptualization, Writing - review & editing. Marta Zorrilla: Writing - review & editing. Pablo Sánchez: Conceptualization, Methodology, Writing - original draft, Writing - review & editing, Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work has been funded by the Spanish Government under grant TIN2017-86520-C3-3-R. Some icons in Fig. 2, Fig. 8 were created by Smartline from Flaticon.
References (51)
- et al.
A data mining system for providing analytical information on brain tumors to public health decision makers
Comput. Methods Programs Biomed.
(2013) - et al.
ReVeaLD: A user-driven domain-specific interactive search platform for biomedical research
J. Biomed. Inform.
(2014) - et al.
Discovering and utilising expert knowledge from security event logs
J. Inf. Secur. Appl.
(2019) - et al.
PIVOT and UNPIVOT: optimization and execution strategies in an RDBMS
- et al.
The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing
European J. Oper. Res.
(2006) - et al.
A tutorial on metamodelling for grammar researchers
Sci. Comput. Progr.
(2014) - et al.
Usability driven DSL development with USE-ME
Comput. Lang. Syst. Struct.
(2018) - et al.
FLANDM: a development framework of domain-specific languages for data mining democratisation
Comput. Lang. Syst. Struct.
(2018) - et al.
Domain-specific languages: A systematic mapping study
Inf. Softw. Technol.
(2016) - et al.
Combining heterogeneous classifiers for relational databases
Pattern Recognit.
(2013)
Delivery of agricultural drought information via web services
Earth Sci. Inform.
Recent advances and applications of machine learning in solid-state materials science
npj Comput. Mater.
Data Mining: Practical Machine Learning Tools and Techniques
A study on the importance of and time spent on different modeling steps
SIGKDD Explor. Newsl.
The weka data mining software: An update
SIGKDD Explor. Newsl.
KNIME - the konstanz information miner: version 2.0 and beyond
SIGKDD Explor.
Scikit-learn: Machine learning in Python
J. Mach. Learn. Res.
Head First SQL
A relational model of data for large shared data banks
Commun. ACM
A formal characterization of PIVOT/UNPIVOT
Lavoisier: High-level selection and preparation of data for analysis
From data mining to knowledge discovery in databases
AI Mag.
The next evolution of MDE: a seamless integration of machine learning into domain modeling
Softw. Syst. Model.
Cited by (9)
A domain-specific language for describing machine learning datasets
2023, Journal of Computer LanguagesAn Object-Oriented Database Design for Effective Classification
2022, International Journal of Intelligent Systems and Applications in Engineering