Abstract
Statistical analysis is a crucial component of many data science analytic pipelines, and preparing data for such analysis is a large part of the data ingestion step. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation called SDTA and embody in a language called SDTL. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4185 commands in SAS and 9087 commands in SPSS obtained from a repository. We illustrate how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C\(^2\)Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how a data transformation program could be converted to other functionally equivalent programs, permitting code reuse and result reproducibility. We also illustrate the possibility of using SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations.
Similar content being viewed by others
References
Jarke, M., Koch, J.: Query optimization in database systems. ACM Comput. Surv. 16(2), 111–152 (1984). https://doi.org/10.1145/356924.356928
Ioannidis, Y.E.: Query optimization. ACM Comput. Surv. 28(1), 121–123 (1996). https://doi.org/10.1145/234313.234367
Clark, J., DeRose, S., et al.: Xml path language (xpath) version 1.0. World Wide Web Consortium (1999)
Melton, J.: Advanced SQL: 1999: understanding object-relational and other advanced features. Elsevier Science Inc., New York (2002)
Boncz, P.A., Kersten, M.L.: Mil primitives for querying a fragmented world. VLDB J. 8(2), 101–119 (1999). https://doi.org/10.1007/s007780050076
St, L., Wold, S., et al.: Analysis of variance (ANOVA). Chemom. intell. Lab. Syst. 6(4), 259–272 (1989)
Nelder, J.A., Wedderburn, R.W.: Generalized linear models. J. R. Stat. Soc.: Ser. A 135(3), 370–384 (1972)
Vardigan, M., Heus, P., Thomas, W.: Data documentation initiative: toward a standard for the social sciences. Int. J. Digit. Curation (2008). https://doi.org/10.2218/ijdc.v3i1.45
Fegraus, E.H., Andelman, S., Jones, M.B., Schildhauer, M.: Maximizing the value of ecological data with structured metadata: an introduction to ecological metadata language (eml) and principles for metadata creation. Bull. Ecol. Soc. Am. 86(3), 158–168 (2005)
Alter, G., Donakowski, D., Gager, J., Heus, P., Hunter, C., Ionescu, S., Iverson, J., Jagadish, H., Lagoze, C., Lyle, J., et al.: Provenance metadata for statistical data: an introduction to structured data transformation language (SDTL). IASSIST Q. (2020). https://doi.org/10.29173/iq983
Glavic, B., Dittrich, K.R.: Data provenance: a categorization of existing approaches. In: Datenbanksysteme in buisness, technologie und web, vol. 7, pp. 227–241 (2007)
Song, J., Alter, G., Jagadish, H.V.: C2metadata: automating the capture of data transformations from statistical scripts in data documentation. In: Proceedings of the 2019 international conference on management of data, pp. 2005–2008. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3299869.3320241
Parr, T.J., Quong, R.W.: ANTLR: a predicated-ll (k) parser generator. Software: Pract. Exp. 25(7), 789–810 (1995)
Simmen, D., Shekita, E., Malkemus, T.: Fundamental techniques for order optimization. In: The 1996 ACM SIGMOD international conference, pp. 57–67. ACM Press, New York (1996)
Slivinskas, G., Jensen, C.S., Snodgrass, R.T.: Bringing order to query optimization. SIGMOD Rec. 31(2), 5–14 (2002). https://doi.org/10.1145/565117.565119
Lerner, A., Shasha, D.: Aquery: Query language for ordered data, optimization techniques, and experiments. In: Proceedings of the 29th international conference on very large data bases, vol. 29, pp. 345–356. VLDB Endowment, Berlin (2003)
Shoshani, A.: Olap and statistical databases: similarities and differences. In: Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, pp. 185–196 (1997). https://doi.org/10.1145/263661.263682
Shoshani, A.: Statistical databases: characteristics, problems, and some solutions. In: Proceedings of the 8th international conference on very large data bases, pp. 208–222 (1982)
Özsoyoğlu, G., Özsoyoğlu, Z.M., Matos, V.: Extending relational algebra and relational calculus with set-valued attributes and aggregate functions. ACM Trans. Database Syst. 12(4), 566–592 (1987). https://doi.org/10.1145/32204.32219
Baumann, P., Dehmel, A., Furtado, P., Ritsch, R., Widmann, N.: The multidimensional database system rasdaman. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, pp. 575–577 (1998)
Cornacchia, R., Héman, S., Zukowski, M., Vries, A.P., Boncz, P.: Flexible and efficient IR using array databases. VLDB J. 17(1), 151–168 (2008). https://doi.org/10.1007/s00778-007-0071-0
Brown, P.G.: Overview of scidb: large scale array storage, processing and analysis. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp. 963–968. Association for Computing Machinery, New York (2010). https://doi.org/10.1145/1807167.1807271
Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., Moore, R.: Google earth engine: planetary-scale geospatial analysis for everyone. Remote Sens. Environ. 202, 18–27 (2017). https://doi.org/10.1016/j.rse.2017.06.031
Ramsey, P., Columbia, V.-B. (2005) Introduction to postgis. Refractions Research Inc, Victoria, pp 34–35
Shekhar S., Xiong H.: Oracle Spatial GeoRaster. In: Shekhar, S., Xiong, H. (eds.) Encyclopedia of GIS. Springer, Boston, MA (2008). https://doi.org/10.1007/978-0-387-35973-1_937
Libkin, L., Machlin, R., Wong, L.: A query language for multidimensional arrays: design, implementation, and optimization techniques. SIGMOD Rec. 25(2), 228–239 (1996). https://doi.org/10.1145/235968.233335
Maier, D., Vance, B.: A call to order. In: Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, pp. 1–16. Association for Computing Machinery, New York (1993). https://doi.org/10.1145/153850.153851
Baumann, P.: Management of multidimensional discrete data. VLDB J. 3(4), 401–444 (1994)
Marathe, A.P., Salem, K.: A language for manipulating arrays. In: Proceedings of the 23rd international conference on very large data bases, pp. 46–55. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970). https://doi.org/10.1145/362384.362685
Lorentzos, N.A., Johnson, R.G.: Extending relational algebra to manipulate temporal data. Inf. Syst. 13(3), 289–296 (1988)
Roth, M.A., Korth, H.F., Silberschatz, A.: Extended algebra and calculus for nested relational databases. ACM Trans. Database Syst. (TODS) 13(4), 389–417 (1988). https://doi.org/10.1145/49346.49347
Özsoyoğlu, G., Özsoyoğlu, Z., Matos, V.: Extending relational algebra and relational calculus with set-valued attributes and aggregate functions. ACM Trans. Database Syst. (TODS) 12(4), 566–592 (1987)
Baumann, P., Holsten, S.: A comparative analysis of array models for databases. Int. J. Database Theory Appl. 5(1), 89–120 (2012)
Wyss, C.M., Robertson, E.L.: Relational languages for metadata integration. ACM Trans. Database Syst. (TODS) 30(2), 624–660 (2005)
Lakshmanan, L.V.S., Sadri, F., Subramanian, I.N.: Schemasql—a language for interoperability in relational multi-database systems, pp. 239–250. Morgan Kaufmann Publishers Inc., San Francisco (1996)
Gottlob, G., Zicari, R.: Closed world databases opened through null values. In: Proceedings of the 14th international conference on very large data bases, pp. 50–61. Morgan Kaufmann Publishers Inc., San Francisco (1988)
Zaniolo, C.: Database relations with null values. In: Proceedings of the 1st ACM SIGACT-SIGMOD symposium on principles of database systems, pp. 27–33. Association for Computing Machinery, New York (1982). https://doi.org/10.1145/588111.588117
Raman, V., Hellerstein, J.M.: Potter’s wheel: a interactive data cleaning system. In: Proceedings of the 27th international conference on very large data bases, pp. 381–390. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp. 3363–3372. Association for Computing Machinery, New York (2011). https://doi.org/10.1145/1978942.1979444
Jin, Z., Anderson, M.R., Cafarella, M., Jagadish, H.V.: Foofah: Transforming data by example. In: Proceedings of the 2017 ACM international conference on management of data, pp. 683–698. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3035918.3064034
Wickham, H., François, R., Henry, L., Müller, K.: Dplyr: a grammar of data manipulation. R package version 0.7.6. (2018). https://CRAN.R-project.org/package=dplyr Accessed 16 May 2022
Wickham, H.: Tidyverse: Easily Install and Load the ’Tidyverse’. R package version 1.2.1. (2017).https://CRAN.R-project.org/package=tidyverse. Accessed 16 May 2022
Wickham, H.: Reshaping data with the reshape package. J. Stat. Softw. 21(12), 1–20 (2007)
Circle Systems, I.: Stat/transfer (version 14) (2015). https://stattransfer.com Accessed 16 May 2022
Inc., S.I.: Calling functions in the R language (2016). https://support.sas.com/ rnd/app/studio/statr.pdf Accessed 16 May 2022
Dalzell, C.: Calling R from SPSS (2013). https://developer.ibm.com/tutorials/ba-call-r-spss/ Accessed 16 May 2022
Haghish, E.: Seamless interactive language interfacing between r and stata. Stata J. 19(1), 61–82 (2019)
SDMX: validation and transformation language (VTL) (2018). https://sdmx.org/?page_id=5096 Accessed 16 May 2022
Acknowledgements
This research is supported in part by the National Science Foundation through Grant ACI-1640575 and IIS-1250880. We also thank our colleagues at Colectica, ICPSR, Metadata Technology North America and Norwegian Centre for Research Data who participated in the specification of SDTL and creation of the applications in the C\(^2\)Metadata workflow.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Song, J., Alter, G. & Jagadish, H.V. Structured data transformation algebra (SDTA) and its applications. Distrib Parallel Databases 40, 373–408 (2022). https://doi.org/10.1007/s10619-022-07418-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-022-07418-6