research-article

Public Access

SDTA: An Algebra for Statistical Data Transformation

Authors:
Jie Song

University of Michigan, Ann Arbor

University of Michigan, Ann Arbor
View Profile

,
H. V. Jagadish

University of Michigan, Ann Arbor

University of Michigan, Ann Arbor
View Profile

,
George Alter

University of Michigan, Ann Arbor

University of Michigan, Ann Arbor
View Profile

SSDBM '21: Proceedings of the 33rd International Conference on Scientific and Statistical Database ManagementJuly 2021Pages 109–120https://doi.org/10.1145/3468791.3468811

Published:11 August 2021Publication History

SSDBM '21: Proceedings of the 33rd International Conference on Scientific and Statistical Database Management

Pages 109–120

ABSTRACT

Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages.

Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL), recently adopted by the DDI Alliance that maintains international standards for metadata as part of its suite of products. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository.

We illustrate with examples how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how functional-equivalent transformation programs could be converted to other functionally equivalent programs, in the same or different language, permitting code reuse and result reproducibility, We also illustrate the possibility of using of SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations.

References

George Alter, Darrell Donakowski, Jack Gager, Pascal Heus, Carson Hunter, Sanda Ionescu, Jeremy Iverson, HV Jagadish, Carl Lagoze, Jared Lyle, 2020. Provenance metadata for statistical data: An introduction to Structured Data Transformation Language (SDTL). IASSIST Quarterly 44, 4 (2020).Google Scholar
Peter Baumann. 1994. Management of multidimensional discrete data. The VLDB Journal 3, 4 (Oct. 1994), 401–444.Google ScholarCross Ref
P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. 1998. The Multidimensional Database System RasDaMan. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (Seattle, Washington, USA) (SIGMOD ’98). Association for Computing Machinery, New York, NY, USA, 575–577. https://doi.org/10.1145/276304.276386Google ScholarDigital Library
Peter Baumann and Sönke Holsten. 2011. A Comparative Analysis of Array Models for Databases.FGIT-DTA/BSBT (2011).Google Scholar
Peter Baumann and Sönke Holsten. 2011. A comparative analysis of array models for databases. In Database theory and application, bio-science and bio-technology. Springer, 80–89.Google Scholar
Peter A Boncz and Martin L Kersten. 1999. MIL primitives for querying a fragmented world. The VLDB Journal 8, 2 (1999), 101–119.Google ScholarDigital Library
Inc Circle Systems. 2015. Stat/Transfer (version 14). https://stattransfer.comGoogle Scholar
James Clark, Steve DeRose, 1999. XML path language (XPath).Google Scholar
Edgar F Codd. 2002. A relational model of data for large shared data banks. In Software pioneers. Springer, 263–294.Google Scholar
Colectica. 2017. Convention-based Ontology Generation System (COGS) 1.0. http://cogsdata.org/docs/.Google Scholar
Roberto Cornacchia, Sándor Héman, Marcin Zukowski, Arjen P de Vries, and Peter A Boncz. 2008. Flexible and efficient IR using array databases.VLDB J. (2008).Google Scholar
Catherine Dalzell. 2013. Calling R from SPSS. https://developer.ibm.com/tutorials/ba-call-r-spss/Google Scholar
Boris Glavic and Klaus R Dittrich. 2007. Data Provenance: A Categorization of Existing Approaches.. In BTW, Vol. 7. 227–241.Google Scholar
Noel Gorelick, Matt Hancher, Mike Dixon, Simon Ilyushchenko, David Thau, and Rebecca Moore. 2017. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment(2017). https://doi.org/10.1016/j.rse.2017.06.031Google Scholar
Georg Gottlob and Roberto Zicari. 1988. Closed World Databases Opened Through Null Values.VLDB (1988).Google Scholar
EF Haghish. 2019. Seamless interactive language interfacing between R and Stata. The Stata Journal 19, 1 (2019), 61–82.Google ScholarCross Ref
SAS Institute Inc.2016. Calling Functions in the R Language. https://support.sas.com/ rnd/app/studio/statr.pdfGoogle Scholar
Zhongjun Jin, Michael R Anderson, Michael J Cafarella, and H V Jagadish. 2017. Foofah - Transforming Data By Example.SIGMOD Conference (2017).Google Scholar
Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. 2011. Wrangler - interactive visual specification of data transformation scripts.CHI (2011).Google Scholar
Laks VS Lakshmanan, Fereidoon Sadri, and Iyer N Subramanian. 1996. SchemaSQL-a language for interoperability in relational multi-database systems. In VLDB, Vol. 96. Citeseer, 239–250.Google Scholar
Alberto Lerner and Dennis E Shasha. 2003. AQuery - Query Language for Ordered Data, Optimization Techniques, and Experiments.VLDB (2003).Google Scholar
Leonid Libkin, Rona Machlin, and Limsoon Wong. 1996. A Query Language for Multidimensional Arrays - Design, Implementation, and Optimization Techniques.SIGMOD Conference (1996).Google Scholar
Nikos A Lorentzos and Roger G Johnson. 1988. Extending relational algebra to manipulate temporal data. Information Systems 13, 3 (1988), 289–296.Google ScholarDigital Library
Arunprasad P Marathe and Kenneth Salem. 1997. A Language for Manipulating Arrays.VLDB (1997).Google Scholar
Jim Melton. 2003. Advanced SQL: 1999: Understanding object-relational and other advanced features. Morgan Kaufmann.Google Scholar
Chuck Murray, Janet Blowney, J Xie, T Xu, and S Yuditskaya. 2003. Oracle Spatial GeoRaster, 10 g Release 1. Oracle Corporation (2003), 2–31.Google Scholar
G Özsoyoğlu, ZM Özsoyoğlu, and Victor Matos. 1987. Extending relational algebra and relational calculus with set-valued attributes and aggregate functions. ACM Transactions on Database Systems (TODS) 12, 4 (1987), 566–592.Google ScholarDigital Library
Gultekin Özsoyoglu and Z Meral Özsoyoglu. 1985. Statistical Database Query Languages.IEEE Trans. Software Eng.(1985).Google ScholarCross Ref
Vijayshankar Raman and Joseph M Hellerstein. 2001. Potter’s Wheel - An Interactive Data Cleaning System.VLDB (2001).Google Scholar
P Ramsey, VB Columbia Refractions Research Inc, and 2005. [n.d.]. Introduction to PostGIS. drm.cenn.ge ([n. d.]).Google Scholar
Mark A Roth, Herry F Korth, and Abraham Silberschatz. 1988. Extended algebra and calculus for nested relational databases. ACM Transactions on Database Systems (TODS) 13, 4 (1988), 389–417.Google ScholarDigital Library
SDMX. 2018. Validation and Transformation Language (VTL). https://sdmx.org/?page_id=5096Google Scholar
Arie Shoshani. 1982. Statistical Databases - Characteristics, Problems, and some Solutions.VLDB (1982).Google Scholar
Arie Shoshani. 1997. OLAP and Statistical Databases - Similarities and Differences.PODS (1997).Google Scholar
David Simmen, Eugene Shekita, and Timothy Malkemus. 1996. Fundamental techniques for order optimization. In the 1996 ACM SIGMOD international conference. ACM Press, New York, New York, USA, 57–67.Google ScholarDigital Library
SlivinskasGiedrius, JensenChristian S, and SnodgrassRichard Thomas. 2002. Bringing order to query optimization. ACM SIGMOD Record (June 2002).Google Scholar
Jie Song, George Alter, and H. V. Jagadish. 2019. C2Metadata: Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation. Proceedings of the 2019 International Conference on Management of Data (2019).Google ScholarDigital Library
Michael Stonebraker, Paul Brown, Alex Poliakov, and Suchi Raman. 2011. The Architecture of SciDB.SSDBM (2011).Google Scholar
Alex R van Ballegooij. 2004. RAM: A Multidimensional Array DBMS. In Current Trends in Database Technology - EDBT 2004 Workshops. Springer, Berlin, Heidelberg, Berlin, Heidelberg, 154–165.Google Scholar
Hadley Wickham. 2007. Reshaping Data with the reshape Package. Journal of Statistical Software 21, 12 (2007), 1–20. http://www.jstatsoft.org/v21/i12/Google ScholarCross Ref
Hadley Wickham. 2017. tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse R package version 1.2.1.Google Scholar
Hadley Wickham, Romain François, Lionel Henry, and Kirill MÃŒller. 2018. dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr R package version 0.7.6.Google Scholar
Catharine M Wyss and Edward L Robertson. 2005. Relational languages for metadata integration. ACM Transactions on Database Systems (TODS) 30, 2 (2005), 624–660.Google ScholarDigital Library
Carlo Zaniolo. 1984. Database Relations with Null Values.J. Comput. Syst. Sci.(1984).Google Scholar

Recommendations

Structured data transformation algebra (SDTA) and its applications
Abstract
Statistical analysis is a crucial component of many data science analytic pipelines, and preparing data for such analysis is a large part of the data ingestion step. This task is generally accomplished by writing transformation scripts in ... $^{}$
Read More
Statistical relational tables for statistical database management

This paper extends Codd's relational view to represent statistical data and to achieve the efficient analysis of statistical data. It discusses why the relational calculus has not been popular with statisticians. A new view called a statistical ...
Read More
C²Metadata: Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Datasets are often derived by manipulating raw data with statistical software packages. The derivation of a dataset must be recorded in terms of both the raw input and the manipulations applied to it. Statistics packages typically provide limited help ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SSDBM '21: Proceedings of the 33rd International Conference on Scientific and Statistical Database Management
July 2021
275 pages
ISBN:9781450384131
DOI:10.1145/3468791
Editors:
Qiang Zhu
University of Michigan - Dearborn, USA
,
Xingquan (Hill) Zhu
Florida Atlantic University, USA
,
Yicheng Tu
University of South Florida, USA
,
Zichen (Frank) Xu
Nanchang University, China
,
Anand Kumar
Amazon Inc., USA
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 August 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data documentation
data transformation algebra
statistical data
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate56of146submissions,38%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 258
  Total Downloads
- Downloads (Last 12 months)154
- Downloads (Last 6 weeks)34
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

SDTA: An Algebra for Statistical Data Transformation

SSDBM '21: Proceedings of the 33rd International Conference on Scientific and Statistical Database Management

ABSTRACT

References

Cited By

Recommendations

Structured data transformation algebra (SDTA) and its applications

Statistical relational tables for statistical database management

C²Metadata: Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

SDTA: An Algebra for Statistical Data Transformation

SSDBM '21: Proceedings of the 33rd International Conference on Scientific and Statistical Database Management

ABSTRACT

References

Cited By

Recommendations

Structured data transformation algebra (SDTA) and its applications

Statistical relational tables for statistical database management

C2Metadata: Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media

C²Metadata: Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation