Lavoisier: High-Level Selection and Preparation of Data for Analysis

de la Vega, Alfonso; García-Saiz, Diego; Zorrilla, Marta; Sánchez, Pablo

doi:10.1007/978-3-030-32065-2_4

Lavoisier: High-Level Selection and Preparation of Data for Analysis

Alfonso de la Vega¹⁰,
Diego García-Saiz¹⁰,
Marta Zorrilla¹⁰ &
…
Pablo Sánchez¹⁰

Conference paper
First Online: 21 October 2019

732 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11815))

Abstract

Most data mining algorithms require their input data to be provided in a very specific tabular format. Data scientists typically achieve this task by creating long and complex scripts, written in data management languages such as SQL, R or Pandas, where different low-level data transformation operations are performed. The process of writing these scripts can be really time-consuming and error-prone, which decreases data scientists’ productivity. To overcome this limitation, we present Lavoisier, a declarative language for data extraction and formatting. This language provides a set of high-level constructs that allow data scientists to abstract from low-level data formatting operations. Consequently, data extraction scripts’ size and complexity are reduced, contributing to an increase of the productivity with respect to using conventional data manipulation tools.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://www.yelp.com/dataset/challenge.
2.
The Features inheritance of Fig. 1 has been omitted from this initial example.
3.
https://github.com/alfonsodelavega/lavoisier.
4.
https://github.com/alfonsodelavega/lavoisier-evaluation.

References

Beighley, L.: Head First SQL. O’Reilly (2007)
Google Scholar
Boullé, M., et al.: A scalable robust and automatic propositionalization approach for Bayesian classification of large mixed numerical and categorical data. Mach. Learn. (2018). https://doi.org/10.1007/s10994-018-5746-9
Article MathSciNet Google Scholar
Crone, S.F., Lessmann, S., Stahlbock, R.: The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing. Eur. J. Oper. Res. 173(3), 781–800 (2006). https://doi.org/10.1016/j.ejor.2005.07.023
Article MathSciNet MATH Google Scholar
Cunningham, C.: PIVOT and UNPIVOT: optimization and execution strategies in an RDBMS. In: International Conference on Very Large Data Bases, pp. 998–1009 (2004)
Google Scholar
Džeroski, S.: Relational data mining. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 887–911. Springer, Boston (2010). https://doi.org/10.1007/978-0-387-09823-4_46
Chapter Google Scholar
Evans, E.: Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley Professional, Boston (2004)
Google Scholar
Eysholdt, M., Behrens, H.: Xtext: implement your language faster than the quick and dirty way. In: Companion to the 25th Annual Conference on Object-Oriented Programming, Systems, Languages, and Applications (SPLASH/OOPSLA), pp. 307–309 (2010). https://doi.org/10.1145/1869542.1869625
Fowler, M.: Patterns of Enterprise Application Architecture. Addison-Wesley Longman Publishing Co., Inc., Boston (2002)
Google Scholar
Knobbe, A.J., de Haas, M., Siebes, A.: Propositionalisation and aggregates. In: De Raedt, L., Siebes, A. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, pp. 277–288. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44794-6_23
Chapter MATH Google Scholar
McKinney, W.: Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference, pp. 51–56 (2010)
Google Scholar
Munson, M.A.: A study on the importance of and time spent on different modeling steps. SIGKDD Explor. Newsl. 13(2), 65–71 (2012). https://doi.org/10.1145/2207243.2207253
Article Google Scholar
R: The R Project for Statistical Computing. https://www.r-project.org/
Samorani, M.: Automatically generate a flat mining table with dataconda. In: IEEE International Conference on Data Mining Workshop, pp. 1644–1647 (2016). https://doi.org/10.1109/ICDMW.2015.100
de la Vega, A., García-Saiz, D., Zorrilla, M., Sánchez, P.: On the automated transformation of domain models into tabular datasets. ER FORUM 1979 (2017)
Google Scholar
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques, 4th edn. Morgan Kaufmann Publishers Inc., San Francisco (2016)
Google Scholar
Yelp: Dataset Challenge. https://www.yelp.com/dataset_challenge

Download references

Acknowledgements

Funded by the University of Cantabria’s Doctorate Program, and by the Spanish Government under grant TIN2017-86520-C3-3-R.

Author information

Authors and Affiliations

Software Engineering and Real-Time, University of Cantabria, Santander, Spain
Alfonso de la Vega, Diego García-Saiz, Marta Zorrilla & Pablo Sánchez

Authors

Alfonso de la Vega
View author publications
You can also search for this author in PubMed Google Scholar
Diego García-Saiz
View author publications
You can also search for this author in PubMed Google Scholar
Marta Zorrilla
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Sánchez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alfonso de la Vega .

Editor information

Editors and Affiliations

UIUC Institute, Zhejiang University, Zhejiang, China
Klaus-Dieter Schewe
INPT-ENSEEIHT/IRIT, Toulouse, France
Neeraj Kumar Singh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de la Vega, A., García-Saiz, D., Zorrilla, M., Sánchez, P. (2019). Lavoisier: High-Level Selection and Preparation of Data for Analysis. In: Schewe, KD., Singh, N. (eds) Model and Data Engineering. MEDI 2019. Lecture Notes in Computer Science(), vol 11815. Springer, Cham. https://doi.org/10.1007/978-3-030-32065-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-32065-2_4
Published: 21 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32064-5
Online ISBN: 978-3-030-32065-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics