Skip to main content
Log in

Structured data transformation algebra (SDTA) and its applications

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Statistical analysis is a crucial component of many data science analytic pipelines, and preparing data for such analysis is a large part of the data ingestion step. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation called SDTA and embody in a language called SDTL. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4185 commands in SAS and 9087 commands in SPSS obtained from a repository. We illustrate how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C\(^2\)Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how a data transformation program could be converted to other functionally equivalent programs, permitting code reuse and result reproducibility. We also illustrate the possibility of using SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. https://ddialliance.org/products/sdtl/1.0.

References

  1. Jarke, M., Koch, J.: Query optimization in database systems. ACM Comput. Surv. 16(2), 111–152 (1984). https://doi.org/10.1145/356924.356928

    Article  MathSciNet  MATH  Google Scholar 

  2. Ioannidis, Y.E.: Query optimization. ACM Comput. Surv. 28(1), 121–123 (1996). https://doi.org/10.1145/234313.234367

    Article  Google Scholar 

  3. Clark, J., DeRose, S., et al.: Xml path language (xpath) version 1.0. World Wide Web Consortium (1999)

  4. Melton, J.: Advanced SQL: 1999: understanding object-relational and other advanced features. Elsevier Science Inc., New York (2002)

    Google Scholar 

  5. Boncz, P.A., Kersten, M.L.: Mil primitives for querying a fragmented world. VLDB J. 8(2), 101–119 (1999). https://doi.org/10.1007/s007780050076

    Article  Google Scholar 

  6. St, L., Wold, S., et al.: Analysis of variance (ANOVA). Chemom. intell. Lab. Syst. 6(4), 259–272 (1989)

    Article  Google Scholar 

  7. Nelder, J.A., Wedderburn, R.W.: Generalized linear models. J. R. Stat. Soc.: Ser. A 135(3), 370–384 (1972)

    Google Scholar 

  8. Vardigan, M., Heus, P., Thomas, W.: Data documentation initiative: toward a standard for the social sciences. Int. J. Digit. Curation (2008). https://doi.org/10.2218/ijdc.v3i1.45

    Article  Google Scholar 

  9. Fegraus, E.H., Andelman, S., Jones, M.B., Schildhauer, M.: Maximizing the value of ecological data with structured metadata: an introduction to ecological metadata language (eml) and principles for metadata creation. Bull. Ecol. Soc. Am. 86(3), 158–168 (2005)

    Article  Google Scholar 

  10. Alter, G., Donakowski, D., Gager, J., Heus, P., Hunter, C., Ionescu, S., Iverson, J., Jagadish, H., Lagoze, C., Lyle, J., et al.: Provenance metadata for statistical data: an introduction to structured data transformation language (SDTL). IASSIST Q. (2020). https://doi.org/10.29173/iq983

    Article  Google Scholar 

  11. Glavic, B., Dittrich, K.R.: Data provenance: a categorization of existing approaches. In: Datenbanksysteme in buisness, technologie und web, vol. 7, pp. 227–241 (2007)

  12. Song, J., Alter, G., Jagadish, H.V.: C2metadata: automating the capture of data transformations from statistical scripts in data documentation. In: Proceedings of the 2019 international conference on management of data, pp. 2005–2008. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3299869.3320241

  13. Parr, T.J., Quong, R.W.: ANTLR: a predicated-ll (k) parser generator. Software: Pract. Exp. 25(7), 789–810 (1995)

    Google Scholar 

  14. Simmen, D., Shekita, E., Malkemus, T.: Fundamental techniques for order optimization. In: The 1996 ACM SIGMOD international conference, pp. 57–67. ACM Press, New York (1996)

  15. Slivinskas, G., Jensen, C.S., Snodgrass, R.T.: Bringing order to query optimization. SIGMOD Rec. 31(2), 5–14 (2002). https://doi.org/10.1145/565117.565119

    Article  Google Scholar 

  16. Lerner, A., Shasha, D.: Aquery: Query language for ordered data, optimization techniques, and experiments. In: Proceedings of the 29th international conference on very large data bases, vol. 29, pp. 345–356. VLDB Endowment, Berlin (2003)

  17. Shoshani, A.: Olap and statistical databases: similarities and differences. In: Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, pp. 185–196 (1997). https://doi.org/10.1145/263661.263682

  18. Shoshani, A.: Statistical databases: characteristics, problems, and some solutions. In: Proceedings of the 8th international conference on very large data bases, pp. 208–222 (1982)

  19. Özsoyoğlu, G., Özsoyoğlu, Z.M., Matos, V.: Extending relational algebra and relational calculus with set-valued attributes and aggregate functions. ACM Trans. Database Syst. 12(4), 566–592 (1987). https://doi.org/10.1145/32204.32219

    Article  MathSciNet  Google Scholar 

  20. Baumann, P., Dehmel, A., Furtado, P., Ritsch, R., Widmann, N.: The multidimensional database system rasdaman. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, pp. 575–577 (1998)

  21. Cornacchia, R., Héman, S., Zukowski, M., Vries, A.P., Boncz, P.: Flexible and efficient IR using array databases. VLDB J. 17(1), 151–168 (2008). https://doi.org/10.1007/s00778-007-0071-0

    Article  Google Scholar 

  22. Brown, P.G.: Overview of scidb: large scale array storage, processing and analysis. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp. 963–968. Association for Computing Machinery, New York (2010). https://doi.org/10.1145/1807167.1807271

  23. Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., Moore, R.: Google earth engine: planetary-scale geospatial analysis for everyone. Remote Sens. Environ. 202, 18–27 (2017). https://doi.org/10.1016/j.rse.2017.06.031

    Article  Google Scholar 

  24. Ramsey, P., Columbia, V.-B. (2005) Introduction to postgis. Refractions Research Inc, Victoria, pp 34–35

  25. Shekhar S., Xiong H.: Oracle Spatial GeoRaster. In: Shekhar, S., Xiong, H. (eds.) Encyclopedia of GIS. Springer, Boston, MA (2008). https://doi.org/10.1007/978-0-387-35973-1_937

  26. Libkin, L., Machlin, R., Wong, L.: A query language for multidimensional arrays: design, implementation, and optimization techniques. SIGMOD Rec. 25(2), 228–239 (1996). https://doi.org/10.1145/235968.233335

    Article  Google Scholar 

  27. Maier, D., Vance, B.: A call to order. In: Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, pp. 1–16. Association for Computing Machinery, New York (1993). https://doi.org/10.1145/153850.153851

  28. Baumann, P.: Management of multidimensional discrete data. VLDB J. 3(4), 401–444 (1994)

    Article  Google Scholar 

  29. Marathe, A.P., Salem, K.: A language for manipulating arrays. In: Proceedings of the 23rd international conference on very large data bases, pp. 46–55. Morgan Kaufmann Publishers Inc., San Francisco (1997)

  30. Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970). https://doi.org/10.1145/362384.362685

    Article  MATH  Google Scholar 

  31. Lorentzos, N.A., Johnson, R.G.: Extending relational algebra to manipulate temporal data. Inf. Syst. 13(3), 289–296 (1988)

    Article  Google Scholar 

  32. Roth, M.A., Korth, H.F., Silberschatz, A.: Extended algebra and calculus for nested relational databases. ACM Trans. Database Syst. (TODS) 13(4), 389–417 (1988). https://doi.org/10.1145/49346.49347

    Article  MathSciNet  MATH  Google Scholar 

  33. Özsoyoğlu, G., Özsoyoğlu, Z., Matos, V.: Extending relational algebra and relational calculus with set-valued attributes and aggregate functions. ACM Trans. Database Syst. (TODS) 12(4), 566–592 (1987)

    Article  MathSciNet  Google Scholar 

  34. Baumann, P., Holsten, S.: A comparative analysis of array models for databases. Int. J. Database Theory Appl. 5(1), 89–120 (2012)

    Google Scholar 

  35. Wyss, C.M., Robertson, E.L.: Relational languages for metadata integration. ACM Trans. Database Syst. (TODS) 30(2), 624–660 (2005)

    Article  Google Scholar 

  36. Lakshmanan, L.V.S., Sadri, F., Subramanian, I.N.: Schemasql—a language for interoperability in relational multi-database systems, pp. 239–250. Morgan Kaufmann Publishers Inc., San Francisco (1996)

  37. Gottlob, G., Zicari, R.: Closed world databases opened through null values. In: Proceedings of the 14th international conference on very large data bases, pp. 50–61. Morgan Kaufmann Publishers Inc., San Francisco (1988)

  38. Zaniolo, C.: Database relations with null values. In: Proceedings of the 1st ACM SIGACT-SIGMOD symposium on principles of database systems, pp. 27–33. Association for Computing Machinery, New York (1982). https://doi.org/10.1145/588111.588117

  39. Raman, V., Hellerstein, J.M.: Potter’s wheel: a interactive data cleaning system. In: Proceedings of the 27th international conference on very large data bases, pp. 381–390. Morgan Kaufmann Publishers Inc., San Francisco (2001)

  40. Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp. 3363–3372. Association for Computing Machinery, New York (2011). https://doi.org/10.1145/1978942.1979444

  41. Jin, Z., Anderson, M.R., Cafarella, M., Jagadish, H.V.: Foofah: Transforming data by example. In: Proceedings of the 2017 ACM international conference on management of data, pp. 683–698. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3035918.3064034

  42. Wickham, H., François, R., Henry, L., Müller, K.: Dplyr: a grammar of data manipulation. R package version 0.7.6. (2018). https://CRAN.R-project.org/package=dplyr Accessed 16 May 2022

  43. Wickham, H.: Tidyverse: Easily Install and Load the ’Tidyverse’. R package version 1.2.1. (2017).https://CRAN.R-project.org/package=tidyverse. Accessed 16 May 2022

  44. Wickham, H.: Reshaping data with the reshape package. J. Stat. Softw. 21(12), 1–20 (2007)

    Article  Google Scholar 

  45. Circle Systems, I.: Stat/transfer (version 14) (2015). https://stattransfer.com Accessed 16 May 2022

  46. Inc., S.I.: Calling functions in the R language (2016). https://support.sas.com/ rnd/app/studio/statr.pdf Accessed 16 May 2022

  47. Dalzell, C.: Calling R from SPSS (2013). https://developer.ibm.com/tutorials/ba-call-r-spss/ Accessed 16 May 2022

  48. Haghish, E.: Seamless interactive language interfacing between r and stata. Stata J. 19(1), 61–82 (2019)

    Article  Google Scholar 

  49. SDMX: validation and transformation language (VTL) (2018). https://sdmx.org/?page_id=5096 Accessed 16 May 2022

Download references

Acknowledgements

This research is supported in part by the National Science Foundation through Grant ACI-1640575 and IIS-1250880. We also thank our colleagues at Colectica, ICPSR, Metadata Technology North America and Norwegian Centre for Research Data who participated in the specification of SDTL and creation of the applications in the C\(^2\)Metadata workflow.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jie Song.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Song, J., Alter, G. & Jagadish, H.V. Structured data transformation algebra (SDTA) and its applications. Distrib Parallel Databases 40, 373–408 (2022). https://doi.org/10.1007/s10619-022-07418-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-022-07418-6

Keywords

Navigation