Skip to main content

Lavoisier: High-Level Selection and Preparation of Data for Analysis

  • Conference paper
  • First Online:
  • 732 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11815))

Abstract

Most data mining algorithms require their input data to be provided in a very specific tabular format. Data scientists typically achieve this task by creating long and complex scripts, written in data management languages such as SQL, R or Pandas, where different low-level data transformation operations are performed. The process of writing these scripts can be really time-consuming and error-prone, which decreases data scientists’ productivity. To overcome this limitation, we present Lavoisier, a declarative language for data extraction and formatting. This language provides a set of high-level constructs that allow data scientists to abstract from low-level data formatting operations. Consequently, data extraction scripts’ size and complexity are reduced, contributing to an increase of the productivity with respect to using conventional data manipulation tools.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.yelp.com/dataset/challenge.

  2. 2.

    The Features inheritance of Fig. 1 has been omitted from this initial example.

  3. 3.

    https://github.com/alfonsodelavega/lavoisier.

  4. 4.

    https://github.com/alfonsodelavega/lavoisier-evaluation.

References

  1. Beighley, L.: Head First SQL. O’Reilly (2007)

    Google Scholar 

  2. Boullé, M., et al.: A scalable robust and automatic propositionalization approach for Bayesian classification of large mixed numerical and categorical data. Mach. Learn. (2018). https://doi.org/10.1007/s10994-018-5746-9

    Article  MathSciNet  Google Scholar 

  3. Crone, S.F., Lessmann, S., Stahlbock, R.: The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing. Eur. J. Oper. Res. 173(3), 781–800 (2006). https://doi.org/10.1016/j.ejor.2005.07.023

    Article  MathSciNet  MATH  Google Scholar 

  4. Cunningham, C.: PIVOT and UNPIVOT: optimization and execution strategies in an RDBMS. In: International Conference on Very Large Data Bases, pp. 998–1009 (2004)

    Google Scholar 

  5. Džeroski, S.: Relational data mining. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 887–911. Springer, Boston (2010). https://doi.org/10.1007/978-0-387-09823-4_46

    Chapter  Google Scholar 

  6. Evans, E.: Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley Professional, Boston (2004)

    Google Scholar 

  7. Eysholdt, M., Behrens, H.: Xtext: implement your language faster than the quick and dirty way. In: Companion to the 25th Annual Conference on Object-Oriented Programming, Systems, Languages, and Applications (SPLASH/OOPSLA), pp. 307–309 (2010). https://doi.org/10.1145/1869542.1869625

  8. Fowler, M.: Patterns of Enterprise Application Architecture. Addison-Wesley Longman Publishing Co., Inc., Boston (2002)

    Google Scholar 

  9. Knobbe, A.J., de Haas, M., Siebes, A.: Propositionalisation and aggregates. In: De Raedt, L., Siebes, A. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, pp. 277–288. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44794-6_23

    Chapter  MATH  Google Scholar 

  10. McKinney, W.: Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference, pp. 51–56 (2010)

    Google Scholar 

  11. Munson, M.A.: A study on the importance of and time spent on different modeling steps. SIGKDD Explor. Newsl. 13(2), 65–71 (2012). https://doi.org/10.1145/2207243.2207253

    Article  Google Scholar 

  12. R: The R Project for Statistical Computing. https://www.r-project.org/

  13. Samorani, M.: Automatically generate a flat mining table with dataconda. In: IEEE International Conference on Data Mining Workshop, pp. 1644–1647 (2016). https://doi.org/10.1109/ICDMW.2015.100

  14. de la Vega, A., García-Saiz, D., Zorrilla, M., Sánchez, P.: On the automated transformation of domain models into tabular datasets. ER FORUM 1979 (2017)

    Google Scholar 

  15. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques, 4th edn. Morgan Kaufmann Publishers Inc., San Francisco (2016)

    Google Scholar 

  16. Yelp: Dataset Challenge. https://www.yelp.com/dataset_challenge

Download references

Acknowledgements

Funded by the University of Cantabria’s Doctorate Program, and by the Spanish Government under grant TIN2017-86520-C3-3-R.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alfonso de la Vega .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

de la Vega, A., García-Saiz, D., Zorrilla, M., Sánchez, P. (2019). Lavoisier: High-Level Selection and Preparation of Data for Analysis. In: Schewe, KD., Singh, N. (eds) Model and Data Engineering. MEDI 2019. Lecture Notes in Computer Science(), vol 11815. Springer, Cham. https://doi.org/10.1007/978-3-030-32065-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-32065-2_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-32064-5

  • Online ISBN: 978-3-030-32065-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics