Abstract
Most data mining algorithms require their input data to be provided in a very specific tabular format. Data scientists typically achieve this task by creating long and complex scripts, written in data management languages such as SQL, R or Pandas, where different low-level data transformation operations are performed. The process of writing these scripts can be really time-consuming and error-prone, which decreases data scientists’ productivity. To overcome this limitation, we present Lavoisier, a declarative language for data extraction and formatting. This language provides a set of high-level constructs that allow data scientists to abstract from low-level data formatting operations. Consequently, data extraction scripts’ size and complexity are reduced, contributing to an increase of the productivity with respect to using conventional data manipulation tools.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
The Features inheritance of Fig. 1 has been omitted from this initial example.
- 3.
- 4.
References
Beighley, L.: Head First SQL. O’Reilly (2007)
Boullé, M., et al.: A scalable robust and automatic propositionalization approach for Bayesian classification of large mixed numerical and categorical data. Mach. Learn. (2018). https://doi.org/10.1007/s10994-018-5746-9
Crone, S.F., Lessmann, S., Stahlbock, R.: The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing. Eur. J. Oper. Res. 173(3), 781–800 (2006). https://doi.org/10.1016/j.ejor.2005.07.023
Cunningham, C.: PIVOT and UNPIVOT: optimization and execution strategies in an RDBMS. In: International Conference on Very Large Data Bases, pp. 998–1009 (2004)
Džeroski, S.: Relational data mining. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 887–911. Springer, Boston (2010). https://doi.org/10.1007/978-0-387-09823-4_46
Evans, E.: Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley Professional, Boston (2004)
Eysholdt, M., Behrens, H.: Xtext: implement your language faster than the quick and dirty way. In: Companion to the 25th Annual Conference on Object-Oriented Programming, Systems, Languages, and Applications (SPLASH/OOPSLA), pp. 307–309 (2010). https://doi.org/10.1145/1869542.1869625
Fowler, M.: Patterns of Enterprise Application Architecture. Addison-Wesley Longman Publishing Co., Inc., Boston (2002)
Knobbe, A.J., de Haas, M., Siebes, A.: Propositionalisation and aggregates. In: De Raedt, L., Siebes, A. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, pp. 277–288. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44794-6_23
McKinney, W.: Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference, pp. 51–56 (2010)
Munson, M.A.: A study on the importance of and time spent on different modeling steps. SIGKDD Explor. Newsl. 13(2), 65–71 (2012). https://doi.org/10.1145/2207243.2207253
R: The R Project for Statistical Computing. https://www.r-project.org/
Samorani, M.: Automatically generate a flat mining table with dataconda. In: IEEE International Conference on Data Mining Workshop, pp. 1644–1647 (2016). https://doi.org/10.1109/ICDMW.2015.100
de la Vega, A., García-Saiz, D., Zorrilla, M., Sánchez, P.: On the automated transformation of domain models into tabular datasets. ER FORUM 1979 (2017)
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques, 4th edn. Morgan Kaufmann Publishers Inc., San Francisco (2016)
Yelp: Dataset Challenge. https://www.yelp.com/dataset_challenge
Acknowledgements
Funded by the University of Cantabria’s Doctorate Program, and by the Spanish Government under grant TIN2017-86520-C3-3-R.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
de la Vega, A., García-Saiz, D., Zorrilla, M., Sánchez, P. (2019). Lavoisier: High-Level Selection and Preparation of Data for Analysis. In: Schewe, KD., Singh, N. (eds) Model and Data Engineering. MEDI 2019. Lecture Notes in Computer Science(), vol 11815. Springer, Cham. https://doi.org/10.1007/978-3-030-32065-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-32065-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32064-5
Online ISBN: 978-3-030-32065-2
eBook Packages: Computer ScienceComputer Science (R0)