ABSTRACT
Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages.
Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL), recently adopted by the DDI Alliance that maintains international standards for metadata as part of its suite of products. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository.
We illustrate with examples how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how functional-equivalent transformation programs could be converted to other functionally equivalent programs, in the same or different language, permitting code reuse and result reproducibility, We also illustrate the possibility of using of SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations.
- George Alter, Darrell Donakowski, Jack Gager, Pascal Heus, Carson Hunter, Sanda Ionescu, Jeremy Iverson, HV Jagadish, Carl Lagoze, Jared Lyle, 2020. Provenance metadata for statistical data: An introduction to Structured Data Transformation Language (SDTL). IASSIST Quarterly 44, 4 (2020).Google Scholar
- Peter Baumann. 1994. Management of multidimensional discrete data. The VLDB Journal 3, 4 (Oct. 1994), 401–444.Google ScholarCross Ref
- P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. 1998. The Multidimensional Database System RasDaMan. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (Seattle, Washington, USA) (SIGMOD ’98). Association for Computing Machinery, New York, NY, USA, 575–577. https://doi.org/10.1145/276304.276386Google ScholarDigital Library
- Peter Baumann and Sönke Holsten. 2011. A Comparative Analysis of Array Models for Databases.FGIT-DTA/BSBT (2011).Google Scholar
- Peter Baumann and Sönke Holsten. 2011. A comparative analysis of array models for databases. In Database theory and application, bio-science and bio-technology. Springer, 80–89.Google Scholar
- Peter A Boncz and Martin L Kersten. 1999. MIL primitives for querying a fragmented world. The VLDB Journal 8, 2 (1999), 101–119.Google ScholarDigital Library
- Inc Circle Systems. 2015. Stat/Transfer (version 14). https://stattransfer.comGoogle Scholar
- James Clark, Steve DeRose, 1999. XML path language (XPath).Google Scholar
- Edgar F Codd. 2002. A relational model of data for large shared data banks. In Software pioneers. Springer, 263–294.Google Scholar
- Colectica. 2017. Convention-based Ontology Generation System (COGS) 1.0. http://cogsdata.org/docs/.Google Scholar
- Roberto Cornacchia, Sándor Héman, Marcin Zukowski, Arjen P de Vries, and Peter A Boncz. 2008. Flexible and efficient IR using array databases.VLDB J. (2008).Google Scholar
- Catherine Dalzell. 2013. Calling R from SPSS. https://developer.ibm.com/tutorials/ba-call-r-spss/Google Scholar
- Boris Glavic and Klaus R Dittrich. 2007. Data Provenance: A Categorization of Existing Approaches.. In BTW, Vol. 7. 227–241.Google Scholar
- Noel Gorelick, Matt Hancher, Mike Dixon, Simon Ilyushchenko, David Thau, and Rebecca Moore. 2017. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment(2017). https://doi.org/10.1016/j.rse.2017.06.031Google Scholar
- Georg Gottlob and Roberto Zicari. 1988. Closed World Databases Opened Through Null Values.VLDB (1988).Google Scholar
- EF Haghish. 2019. Seamless interactive language interfacing between R and Stata. The Stata Journal 19, 1 (2019), 61–82.Google ScholarCross Ref
- SAS Institute Inc.2016. Calling Functions in the R Language. https://support.sas.com/ rnd/app/studio/statr.pdfGoogle Scholar
- Zhongjun Jin, Michael R Anderson, Michael J Cafarella, and H V Jagadish. 2017. Foofah - Transforming Data By Example.SIGMOD Conference (2017).Google Scholar
- Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. 2011. Wrangler - interactive visual specification of data transformation scripts.CHI (2011).Google Scholar
- Laks VS Lakshmanan, Fereidoon Sadri, and Iyer N Subramanian. 1996. SchemaSQL-a language for interoperability in relational multi-database systems. In VLDB, Vol. 96. Citeseer, 239–250.Google Scholar
- Alberto Lerner and Dennis E Shasha. 2003. AQuery - Query Language for Ordered Data, Optimization Techniques, and Experiments.VLDB (2003).Google Scholar
- Leonid Libkin, Rona Machlin, and Limsoon Wong. 1996. A Query Language for Multidimensional Arrays - Design, Implementation, and Optimization Techniques.SIGMOD Conference (1996).Google Scholar
- Nikos A Lorentzos and Roger G Johnson. 1988. Extending relational algebra to manipulate temporal data. Information Systems 13, 3 (1988), 289–296.Google ScholarDigital Library
- Arunprasad P Marathe and Kenneth Salem. 1997. A Language for Manipulating Arrays.VLDB (1997).Google Scholar
- Jim Melton. 2003. Advanced SQL: 1999: Understanding object-relational and other advanced features. Morgan Kaufmann.Google Scholar
- Chuck Murray, Janet Blowney, J Xie, T Xu, and S Yuditskaya. 2003. Oracle Spatial GeoRaster, 10 g Release 1. Oracle Corporation (2003), 2–31.Google Scholar
- G Özsoyoğlu, ZM Özsoyoğlu, and Victor Matos. 1987. Extending relational algebra and relational calculus with set-valued attributes and aggregate functions. ACM Transactions on Database Systems (TODS) 12, 4 (1987), 566–592.Google ScholarDigital Library
- Gultekin Özsoyoglu and Z Meral Özsoyoglu. 1985. Statistical Database Query Languages.IEEE Trans. Software Eng.(1985).Google ScholarCross Ref
- Vijayshankar Raman and Joseph M Hellerstein. 2001. Potter’s Wheel - An Interactive Data Cleaning System.VLDB (2001).Google Scholar
- P Ramsey, VB Columbia Refractions Research Inc, and 2005. [n.d.]. Introduction to PostGIS. drm.cenn.ge ([n. d.]).Google Scholar
- Mark A Roth, Herry F Korth, and Abraham Silberschatz. 1988. Extended algebra and calculus for nested relational databases. ACM Transactions on Database Systems (TODS) 13, 4 (1988), 389–417.Google ScholarDigital Library
- SDMX. 2018. Validation and Transformation Language (VTL). https://sdmx.org/?page_id=5096Google Scholar
- Arie Shoshani. 1982. Statistical Databases - Characteristics, Problems, and some Solutions.VLDB (1982).Google Scholar
- Arie Shoshani. 1997. OLAP and Statistical Databases - Similarities and Differences.PODS (1997).Google Scholar
- David Simmen, Eugene Shekita, and Timothy Malkemus. 1996. Fundamental techniques for order optimization. In the 1996 ACM SIGMOD international conference. ACM Press, New York, New York, USA, 57–67.Google ScholarDigital Library
- SlivinskasGiedrius, JensenChristian S, and SnodgrassRichard Thomas. 2002. Bringing order to query optimization. ACM SIGMOD Record (June 2002).Google Scholar
- Jie Song, George Alter, and H. V. Jagadish. 2019. C2Metadata: Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation. Proceedings of the 2019 International Conference on Management of Data (2019).Google ScholarDigital Library
- Michael Stonebraker, Paul Brown, Alex Poliakov, and Suchi Raman. 2011. The Architecture of SciDB.SSDBM (2011).Google Scholar
- Alex R van Ballegooij. 2004. RAM: A Multidimensional Array DBMS. In Current Trends in Database Technology - EDBT 2004 Workshops. Springer, Berlin, Heidelberg, Berlin, Heidelberg, 154–165.Google Scholar
- Hadley Wickham. 2007. Reshaping Data with the reshape Package. Journal of Statistical Software 21, 12 (2007), 1–20. http://www.jstatsoft.org/v21/i12/Google ScholarCross Ref
- Hadley Wickham. 2017. tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse R package version 1.2.1.Google Scholar
- Hadley Wickham, Romain François, Lionel Henry, and Kirill MÃŒller. 2018. dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr R package version 0.7.6.Google Scholar
- Catharine M Wyss and Edward L Robertson. 2005. Relational languages for metadata integration. ACM Transactions on Database Systems (TODS) 30, 2 (2005), 624–660.Google ScholarDigital Library
- Carlo Zaniolo. 1984. Database Relations with Null Values.J. Comput. Syst. Sci.(1984).Google Scholar
Recommendations
Structured data transformation algebra (SDTA) and its applications
AbstractStatistical analysis is a crucial component of many data science analytic pipelines, and preparing data for such analysis is a large part of the data ingestion step. This task is generally accomplished by writing transformation scripts in ...
Statistical relational tables for statistical database management
This paper extends Codd's relational view to represent statistical data and to achieve the efficient analysis of statistical data. It discusses why the relational calculus has not been popular with statisticians. A new view called a statistical ...
C2Metadata: Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation
SIGMOD '19: Proceedings of the 2019 International Conference on Management of DataDatasets are often derived by manipulating raw data with statistical software packages. The derivation of a dataset must be recorded in terms of both the raw input and the manipulations applied to it. Statistics packages typically provide limited help ...
Comments